Download Page
The HelloThere Corpus

Summary

Abstract

A Time-Offset Interaction Application (TOIA) is a software system that allows people to engage in face-to-face dialogue with previously recorded videos of other people. There are two TOIA usage modes: (a) creation mode, where users pre-record video snippets of themselves representing their answers to possible questions someone may ask them, and (b) interaction mode, where other users of the system can choose to interact with created avatars. This paper presents the HelloThere corpus that has been collected from two user studies involving several people who recorded avatars and many more who engaged in dialogues with them. The interactions with avatars are annotated by people asking them questions through three modes (card selection, text search, and voice input) and rating the appropriateness of their answers on a 1 to 5 scale. The corpus, made available to the research community, comprises 26 avatars' knowledge bases and 317 dialogues between 64 interrogators and the avatars in text format.

Data Schema

The HelloThere Corpus is a collection of question-answer pairs and user interactions designed for research in conversational AI and time-offset interactive dialogue systems. This dataset captures real-world interactions between users and time-offset interaction avatars, providing valuable insights into user behavior, question patterns, and response effectiveness in time-offset dialogue systems.

The dataset consists of three main CSV files:

1. conversations_log.csv

Column Name	Data Type	Description
interactor_id	Integer	Unique identifier for the user interacting with the system
toia_id	Integer	Identifier for the TOIA avatar being interacted with
timestamp	Integer	Unix timestamp of the interaction (milliseconds since epoch)
filler	Boolean	Indicates whether the interaction was a filler (true) or not (false)
question_asked	String	The question asked by the user (empty if filler)
video_played	String	Filename of the video response played
ada_similarity_score	Float	Similarity score calculated using the model and methodology referred in the papers (if applicable)
mode	String	Mode of interaction ("CARD", "VOICE", "SEARCH", and "UNKNOWN" when fillers are played or the system is just idle and the user is thinking what to ask and how)

2. player_feedback.csv

Column Name	Data Type	Description
video_id	String	Identifier for the video response - note the video IDs can be used to link the different tables for advanced analysis
user_id	Integer	Unique identifier for the user providing feedback
question	String	The question associated with the video response
rating	Integer	User rating for the video response (on a scale of 1-5)

3. questions.csv

Column Name	Data Type	Description
stream_id_stream	String	Identifier for the stream (these are a sub-id of a toia_id, indicating different variants of the same TOIA Avatar created, as descripted in the paper)
type	String	Type of question/response (e.g., "answer", "filler", "greeting", ...)
question	String	The question pre-recorded by the TOIA Avatar
id_video	String	Unique identifier for the video response
toia_id	Integer	Identifier for the TOIA avatar
idx	Integer	Index
private	Boolean	Indicates whether the question/response is private (0 for false, 1 for true. For the experimental setup described in the paper, all questions were marked non-private by the TOIA Avatar makers)
answer	String	Transcription of the video response - note these transcriptions are as produced by the Google translate API version used at the time of recording. Often the user edited and corrected transcriptions, but not all of them may be accurate.
onboarding	Boolean	Indicates whether the question is part of an onboarding series of mandatory questions to record (0 for false, 1 for true)

File Format

All data files are provided in CSV (Comma-Separated Values) format. Each file can be easily imported into spreadsheet software or data analysis tools that support CSV.

Data Collection

The data was collected through interactions between users and TOIA (Time-Offset Interaction Application) avatars. Users asked questions to video-based AI avatars, and their interactions, including questions asked and videos played, were recorded. Users also provided feedback on the relevance and quality of the retrieved responses.

Data Processing

Timestamps in conversation_log.csv are converted to Unix timestamp format (milliseconds since epoch).
Video filenames are standardized to include identifiers for easy reference.
User and avatar identifiers are anonymized to protect privacy.

Usage Notes

Privacy: All user identifiers have been anonymized to protect user privacy. However, researchers should be cautious about potential indirect identification through question content. Where we identified personal names, we replaced the text with a `{{FIRST_NAME}}` token.
Bias: The dataset may contain biases inherent in the user base or the design of the TOIA system. Researchers should consider these potential biases in their analyses.
Missing Data: Some fields may contain empty values, especially in the conversation_log.csv file where filler interactions are present.
Timestamps: When working with timestamps in conversation_log.csv, ensure your tools correctly interpret them as milliseconds since epoch.
Video Content: This dataset does not include the actual video files. Researchers will need to rely on the video identifiers and transcriptions provided in the questions.csv file.
Feedback Interpretation: In player_feedback.csv, ratings are on a scale of 1-5 with its interpretation described in the paper.

Team

Alberto Chierici
Nizar Habash

Publications

Alberto Chierici and Nizar Habash. 2024. HelloThere: A Corpus of Annotated Dialogues and Knowledge Bases of Time-Offset Avatars. In proceedings of the 25th Annual Meeting of the Special Interest Group on Discourse and Dialogue.
Alberto Chierici and Nizar Habash. 2023. Tell me more, tell me more: AI-generated question suggestions for the creation of interactive video recordings. In 2023 32nd IEEE International Conference on Robot and Human Interactive Communication (RO-MAN), pages 1725–1730. IEEE.
Alberto Maria Chierici. 2023. Scalable, Human-Like Asynchronous Communication. Ph.D. thesis, New York University Tandon School of Engineering.

Download

By downloading the HelloThere Copus Dataset files from HERE you agree to the terms of the license below.

////////////////////////////////////////////////////////////////////// // License for The HelloThere Corpus ////////////////////////////////////////////////////////////////////// Copyright 2024 New York University Abu Dhabi. All Rights Reserved. A license to use and copy this software, data and its documentation solely for your internal research and evaluation purposes, without fee and without a signed licensing agreement, is hereby granted upon your download of the software, through which you agree to the following: 1) the above copyright notice, this paragraph and the following three paragraphs will prominently appear in all internal copies and modifications; 2) no rights to sublicense or further distribute this software are granted; 3) no rights to modify this software are granted; and 4) no rights to assign this license are granted. Please Contact the Office of Industrial Liaison, New York University, One Park Avenue, 6th Floor, New York, NY 10016 (212) 263-8178, for commercial licensing opportunities, or for further distribution, modification or license rights. Created by Alberto Chierici and Nizar Habash at the Computational Approaches to Modeling Language (CAMeL) Lab in New York University Abu Dhabi. IN NO EVENT SHALL NYU, OR ITS EMPLOYEES, OFFICERS, AGENTS OR TRUSTEES ("COLLECTIVELY "NYU PARTIES") BE LIABLE TO ANY PARTY FOR DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES OF ANY KIND , INCLUDING LOST PROFITS, ARISING OUT OF ANY CLAIM RESULTING FROM YOUR USE OF THIS SOFTWARE, DATA AND ITS DOCUMENTATION, EVEN IF ANY OF NYU PARTIES HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH CLAIM OR DAMAGE. NYU SPECIFICALLY DISCLAIMS ANY WARRANTIES OF ANY KIND REGARDING THE SOFTWARE and DATA, INCLUDING, BUT NOT LIMITED TO, NON-INFRINGEMENT, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE, OR THE ACCURACY OR USEFULNESS, OR COMPLETENESS OF THE SOFTWARE. THE SOFTWARE AND ACCOMPANYING DOCUMENTATION, IF ANY, PROVIDED HEREUNDER IS PROVIDED COMPLETELY "AS IS". REGENTS HAS NO OBLIGATION TO PROVIDE FURTHER DOCUMENTATION, MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS. If you use this resource, cite: Alberto Chierici and Nizar Habash. 2024. HelloThere: A Corpus of Annotated Dialogues and Knowledge Bases of Time-Offset Avatars. In proceedings of the 25th Annual Meeting of the Special Interest Group on Discourse and Dialogue. //////////////////////////////////////////////////////////////////////

Citation

If you use this dataset in your research, please cite it as follows:

@inproceedings{chierici-habash-2024-hellothere,
title={HelloThere: A Corpus of Annotated Dialogues and Knowledge Bases of Time-Offset Avatars},
author={Chierici, Alberto and Habash, Nizar},
booktitle = "Proceedings of the 25th Annual Meeting of the Special Interest Group on Discourse and Dialogue",
year = "2024",
publisher = "Association for Computational Linguistics",
}

CAMeL Lab Resources
CAMeL Lab

Download PageThe HelloThere Corpus