A Time-Offset Interaction Application (TOIA) is a software system that allows people to engage in face-to-face dialogue with previously recorded videos of other people. There are two TOIA usage modes: (a) creation mode, where users pre-record video snippets of themselves representing their answers to possible questions someone may ask them, and (b) interaction mode, where other users of the system can choose to interact with created avatars. This paper presents the HelloThere corpus that has been collected from two user studies involving several people who recorded avatars and many more who engaged in dialogues with them. The interactions with avatars are annotated by people asking them questions through three modes (card selection, text search, and voice input) and rating the appropriateness of their answers on a 1 to 5 scale. The corpus, made available to the research community, comprises 26 avatars' knowledge bases and 317 dialogues between 64 interrogators and the avatars in text format.
The HelloThere Corpus is a collection of question-answer pairs and user interactions designed for research in conversational AI and time-offset interactive dialogue systems. This dataset captures real-world interactions between users and time-offset interaction avatars, providing valuable insights into user behavior, question patterns, and response effectiveness in time-offset dialogue systems.
The dataset consists of three main CSV files:
Column Name | Data Type | Description |
---|---|---|
interactor_id | Integer | Unique identifier for the user interacting with the system |
toia_id | Integer | Identifier for the TOIA avatar being interacted with |
timestamp | Integer | Unix timestamp of the interaction (milliseconds since epoch) |
filler | Boolean | Indicates whether the interaction was a filler (true) or not (false) |
question_asked | String | The question asked by the user (empty if filler) |
video_played | String | Filename of the video response played |
ada_similarity_score | Float | Similarity score calculated using the model and methodology referred in the papers (if applicable) |
mode | String | Mode of interaction ("CARD", "VOICE", "SEARCH", and "UNKNOWN" when fillers are played or the system is just idle and the user is thinking what to ask and how) |
Column Name | Data Type | Description |
---|---|---|
video_id | String | Identifier for the video response - note the video IDs can be used to link the different tables for advanced analysis |
user_id | Integer | Unique identifier for the user providing feedback |
question | String | The question associated with the video response |
rating | Integer | User rating for the video response (on a scale of 1-5) |
Column Name | Data Type | Description |
---|---|---|
stream_id_stream | String | Identifier for the stream (these are a sub-id of a toia_id, indicating different variants of the same TOIA Avatar created, as descripted in the paper) |
type | String | Type of question/response (e.g., "answer", "filler", "greeting", ...) |
question | String | The question pre-recorded by the TOIA Avatar |
id_video | String | Unique identifier for the video response |
toia_id | Integer | Identifier for the TOIA avatar |
idx | Integer | Index |
private | Boolean | Indicates whether the question/response is private (0 for false, 1 for true. For the experimental setup described in the paper, all questions were marked non-private by the TOIA Avatar makers) |
answer | String | Transcription of the video response - note these transcriptions are as produced by the Google translate API version used at the time of recording. Often the user edited and corrected transcriptions, but not all of them may be accurate. |
onboarding | Boolean | Indicates whether the question is part of an onboarding series of mandatory questions to record (0 for false, 1 for true) |
All data files are provided in CSV (Comma-Separated Values) format. Each file can be easily imported into spreadsheet software or data analysis tools that support CSV.
The data was collected through interactions between users and TOIA (Time-Offset Interaction Application) avatars. Users asked questions to video-based AI avatars, and their interactions, including questions asked and videos played, were recorded. Users also provided feedback on the relevance and quality of the retrieved responses.
////////////////////////////////////////////////////////////////////// // License for The HelloThere Corpus ////////////////////////////////////////////////////////////////////// Copyright 2024 New York University Abu Dhabi. All Rights Reserved. A license to use and copy this software, data and its documentation solely for your internal research and evaluation purposes, without fee and without a signed licensing agreement, is hereby granted upon your download of the software, through which you agree to the following: 1) the above copyright notice, this paragraph and the following three paragraphs will prominently appear in all internal copies and modifications; 2) no rights to sublicense or further distribute this software are granted; 3) no rights to modify this software are granted; and 4) no rights to assign this license are granted. Please Contact the Office of Industrial Liaison, New York University, One Park Avenue, 6th Floor, New York, NY 10016 (212) 263-8178, for commercial licensing opportunities, or for further distribution, modification or license rights. Created by Alberto Chierici and Nizar Habash at the Computational Approaches to Modeling Language (CAMeL) Lab in New York University Abu Dhabi. IN NO EVENT SHALL NYU, OR ITS EMPLOYEES, OFFICERS, AGENTS OR TRUSTEES ("COLLECTIVELY "NYU PARTIES") BE LIABLE TO ANY PARTY FOR DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES OF ANY KIND , INCLUDING LOST PROFITS, ARISING OUT OF ANY CLAIM RESULTING FROM YOUR USE OF THIS SOFTWARE, DATA AND ITS DOCUMENTATION, EVEN IF ANY OF NYU PARTIES HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH CLAIM OR DAMAGE. NYU SPECIFICALLY DISCLAIMS ANY WARRANTIES OF ANY KIND REGARDING THE SOFTWARE and DATA, INCLUDING, BUT NOT LIMITED TO, NON-INFRINGEMENT, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE, OR THE ACCURACY OR USEFULNESS, OR COMPLETENESS OF THE SOFTWARE. THE SOFTWARE AND ACCOMPANYING DOCUMENTATION, IF ANY, PROVIDED HEREUNDER IS PROVIDED COMPLETELY "AS IS". REGENTS HAS NO OBLIGATION TO PROVIDE FURTHER DOCUMENTATION, MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS. If you use this resource, cite: Alberto Chierici and Nizar Habash. 2024. HelloThere: A Corpus of Annotated Dialogues and Knowledge Bases of Time-Offset Avatars. In proceedings of the 25th Annual Meeting of the Special Interest Group on Discourse and Dialogue. //////////////////////////////////////////////////////////////////////
If you use this dataset in your research, please cite it as follows:
@inproceedings{chierici-habash-2024-hellothere, title={HelloThere: A Corpus of Annotated Dialogues and Knowledge Bases of Time-Offset Avatars}, author={Chierici, Alberto and Habash, Nizar}, booktitle = "Proceedings of the 25th Annual Meeting of the Special Interest Group on Discourse and Dialogue", year = "2024", publisher = "Association for Computational Linguistics", }