Download Page
The Margarita Dialogue Corpus

Summary

The Margarita Dialogue Corpus is a collection of question-answer pairs defined both outside the context of a conversation and in the context of dialogues between one person and different people. This corpus is part of a methodology developed for creating the knowledge base for time-offset interaction applications and unstructured dialogue systems. The subject of this corpus is Margarita Bicec, a student at New York University Abu Dhabi. She defined a Knowledge Base (KB) of question-answer pairs by brainstorming some pairs and expanded it by recording and transcribing multiple conversations with strangers to fill-in other possible questions. She then recorded videos of her answers using the TOIA recorder developed by Abu Ali and annotated twenty transcribed dialogues by indicating which answers (if any exist) in the KB would be appropriate to play for any questions. Ten dialogues are used to expand the KB (named 'TRAIN' dialogues), and ten 'TEST' dialogues are left out to test answer selection models for unseen dialogues.

This release contains three datasets:

Knowledge Base (KB): a set of 892 Question-Answer pairs. Attributes:

'Category': topic category subjectively defined by the annotator
'Q': question
'A': answer

Dialogues: a set of 659 Question-Answer pairs ordered by dialogue turns. Attributes:

'Mode': 'EDU' or 'PER', indicating personal dialogue or dialogue about the NYUAD campus
'Conversation': conversation (or dialogue) identifier
'Exchange': numeric order Q-A pair (or turn) within one conversation (or dialogue)
'Experiment': 'TRAIN' or 'TEST', indicating whether the dialogue can be used for training or testing purposes.
'Q': question
'A': ground truth answer
'BA1-6': first, second, ..., sixth most appropriate answer present in the KB data

Videos clips: a set of 430 video clips recording the unique answers in the KB. The file KBvideos.csv helps to link the KB's answers to the right video clips.

Team

Alberto Chierici
Nizar Habash
Margarita Bicec

Publications

Abu Ali, Dana, Muaz Ahmad, Hayat Al Hassan, Paula Dozsa, Ming Hu, Jose Varias, and Nizar Habash. "A Bilingual Interactive Human Avatar Dialogue System." In Proceedings of the 19th Annual SIGdial Meeting on Discourse and Dialogue, pp. 241-244. 2018.
Chierici, Alberto, Nizar Habash, and Margarita Bicec. "The Margarita Dialogue Corpus: A Data Set for Time-Offset Interactions and Unstructured Dialogue Systems." In Proceedings of The 12th Language Resources and Evaluation Conference, pp. 476-484. 2020.

Download

By downloading the Margarita Dialogue Copus Dataset files from HERE you agree to the terms of the license below.

//////////////////////////////////////////////////////////////////////
// License for The Margarita Dialogue Corpus
//////////////////////////////////////////////////////////////////////

Copyright 2019 New York University Abu Dhabi. All Rights Reserved.

A license to use and copy this software, data and its documentation solely for your internal research and evaluation purposes, without fee and without a signed licensing agreement, is hereby granted upon your download of the software, through which you agree to the following: 1) the above copyright notice, this paragraph and the following three paragraphs will prominently appear in all internal copies and modifications; 2) no rights to sublicense or further distribute this software are granted; 3) no rights to modify this software are granted; and 4) no rights to assign this license are granted. Please Contact the Office of Industrial Liaison, New York University, One Park Avenue, 6th Floor, New York, NY 10016 (212) 263-8178, for commercial licensing opportunities, or for further distribution, modification or license rights.

Created by Alberto M. Chierici, Nizar Habash and Margarita Bicec at the Computational Approaches to Modeling Language (CAMeL) Lab in New York University Abu Dhabi.

IN NO EVENT SHALL NYU, OR ITS EMPLOYEES, OFFICERS, AGENTS OR TRUSTEES ("COLLECTIVELY "NYU PARTIES") BE LIABLE TO ANY PARTY FOR DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES OF ANY KIND , INCLUDING LOST PROFITS, ARISING OUT OF ANY CLAIM RESULTING FROM YOUR USE OF THIS SOFTWARE, DATA AND ITS DOCUMENTATION, EVEN IF ANY OF NYU PARTIES HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH CLAIM OR DAMAGE.

NYU SPECIFICALLY DISCLAIMS ANY WARRANTIES OF ANY KIND REGARDING THE SOFTWARE and DATA, INCLUDING, BUT NOT LIMITED TO, NON-INFRINGEMENT, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE, OR THE ACCURACY OR USEFULNESS, OR COMPLETENESS OF THE SOFTWARE. THE SOFTWARE AND ACCOMPANYING DOCUMENTATION, IF ANY, PROVIDED HEREUNDER IS PROVIDED COMPLETELY "AS IS". REGENTS HAS NO OBLIGATION TO PROVIDE FURTHER DOCUMENTATION, MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.

If you use this resource, cite:

Chierici et al. (2020): Alberto M. Chierici, Nizar Habash, and Margarita Bicec (2020). The Margarita Dialogue Corpus: A Data Set for Time-Offset Interactions and Unstructured Dialogue Systems. Retrieved from http://resources.camel-lab.com.

//////////////////////////////////////////////////////////////////////

CAMeL Lab Resources
CAMeL Lab

Download PageThe Margarita Dialogue Corpus

Summary

Team

Publications

Download

Download Page
The Margarita Dialogue Corpus