Download Page
MADAR Parallel Corpus Dataset

Summary

The MADAR corpus is a collection of parallel sentences covering the dialects of 25 cities from the Arab World, in addition to English, French, and MSA. The corpus is created by translating selected sentences from the Basic Traveling Expression Corpus (BTEC) (Takezawa et al., 2007) to the different dialects. The exact details on the translation process and source and target languages are described in Bouamor et al. (2018).

The list of Arab cities covered in the MADAR corpus includes: Aleppo, Alexandria, Algiers, Amman, Aswan, Baghdad, Basra, Beirut, Benghazi, Cairo, Damascus, Doha, Fes, Jeddah, Jerusalem, Khartoum, Mosul, Muscat, Rabat, Riyadh, Salt, Sanaa, Sfax, Tripoli, and Tunis.

This release contains two datasets:

Corpus-26: a set of 2,000 BTEC sentences and translated to all 25 city dialects (each of these sentences has 25 corresponding parallel translations), in addition to MSA.
Corpus-6: a set of 12,000 sentences translated to the dialects of five selected cities: Doha, Beirut, Cairo, Tunis, and Rabat, in addition to MSA.

Note: We do not provide the English or the French versions of the corpus because of copyright restrictions. In order to get access to the English and French corpus, you will have to contact the U-Star Consortium (u-star-sec@ustarconsortium.com).

Team

Houda Bouamor
Nizar Habash
Mohammad Salameh
Wajdi Zaghouani
Owen Rambow
Dana Abdulrahim
Ossama Obeid
Salam Khalifa
Fadhl Eryani
Alexander Erdmann
Kemal Oflazer

Publications

Bouamor, Houda, Nizar Habash, Mohammad Salameh, Wajdi Zaghouani, Owen Rambow, Dana Abdulrahim, Ossama Obeid, Salam Khalifa, Fadhl Eryani, Alexander Erdmann and Kemal Oflazer. The MADAR Arabic Dialect Corpus and Lexicon. In Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, 2018. [PDF]

Download

By downloading the MADAR Parallel Corpus Dataset files from HERE you agree to the terms of the license below.

//////////////////////////////////////////////////////////////////////
// License for MADAR Corpus/Lexicon Dataset
//////////////////////////////////////////////////////////////////////

Copyright 2019 Carnegie Mellon University and New York University Abu Dhabi. All Rights Reserved.

A license to use and copy this dataset and its documentation solely for your internal research and evaluation purposes, without fee and without a signed licensing agreement, is hereby granted upon your download of the dataset, through which you agree to the following: 1) the above copyright notice, this paragraph and the following three paragraphs will prominently appear in all internal copies and modifications; 2) no rights to sublicense or further distribute this software are granted; 3) no rights to modify this dataset are granted; and 4) no rights to assign this license are granted. Please Contact the Carnegie Mellon University "CMU" Center for Technology Transfer and Enterprise Creation, 4615 Forbes Avenue, Suite 302, Pittsburgh, PA 15213 - phone 412.268.7393, for commercial licensing opportunities, or for further distribution, modification or license rights.

Created by Houda Bouamor, Nizar Habash, Mohammad Salameh, Wajdi Zaghouani, Owen Rambow, Dana Abdulrahim, Ossama Obeid, Salam Khalifa, Fadhl Eryani, Alexander Erdmann and Kemal Oflazer.

IN NO EVENT SHALL CMU OR NYU, OR THEIR EMPLOYEES, OFFICERS, AGENTS OR TRUSTEES ("COLLECTIVELY "CMU/NYU PARTIES") BE LIABLE TO ANY PARTY FOR DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES OF ANY KIND, INCLUDING LOST PROFITS, ARISING OUT OF ANY CLAIM RESULTING FROM YOUR USE OF THIS DATASET AND ITS DOCUMENTATION, EVEN IF ANY OF CMU/NYU PARTIES HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH CLAIM OR DAMAGE.

CMU/NYU SPECIFICALLY DISCLAIMS ANY WARRANTIES OF ANY KIND REGARDING THE DATASET, INCLUDING, BUT NOT LIMITED TO, NON-INFRINGEMENT, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE, OR THE ACCURACY OR USEFULNESS, OR COMPLETENESS OF THE SOFTWARE. THE SOFTWARE AND ACCOMPANYING DOCUMENTATION, IF ANY, PROVIDED HEREUNDER IS PROVIDED COMPLETELY "AS IS". REGENTS HAS NO OBLIGATION TO PROVIDE FURTHER DOCUMENTATION, MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.

If you use this resource, cite:

Bouamor, Houda, Nizar Habash, Mohammad Salameh, Wajdi Zaghouani, Owen Rambow, Dana Abdulrahim, Ossama Obeid, Salam Khalifa, Fadhl Eryani, Alexander Erdmann and Kemal Oflazer. The MADAR Arabic Dialect Corpus and Lexicon. In Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, 2018.

//////////////////////////////////////////////////////////////////////

CAMeL Lab Resources
CAMeL Lab

Download PageMADAR Parallel Corpus Dataset

Summary

Team

Publications

Download

Download Page
MADAR Parallel Corpus Dataset