The MADAR CODA Corpus contains 10,000 sentences corrected using the Conventional Orthography for Dialectal Arabic (CODA), along with their original raw form. The sentences were taken from the 5 Arab city dialects (Beirut, Cairo, Doha, Rabat, and Tunis) that make up MADAR.Corpus6, a subset of the MADAR Parallel Corpora. The corpus was created by a human annotator with the aid of a bootstrapping technique, and validated manually. More details are provided in Eryani et al. (2020) (see references).
This release contains two directories, train and test, each of which contains 5 tsv files corresponding to each city dialect. Each tsv contains 1,000 sentences in raw and CODA form. To ease usability between the different corpora, we follow MADAR Parallel Corpora by including the following additional columns in each tsv:
By downloading the MADAR Parallel Corpus Dataset files from HERE you agree to the terms of the license below.
//////////////////////////////////////////////////////////////////////////////
// License for MADAR CODA Corpus
//////////////////////////////////////////////////////////////////////////////
Copyright 2019 Carnegie Mellon University and New York University Abu
Dhabi. All Rights Reserved.
A license to use and copy this dataset and its documentation solely
for your internal research and evaluation purposes, without fee and
without a signed licensing agreement, is hereby granted upon your
download of the dataset, through which you agree to the following: 1)
the above copyright notice, this paragraph and the following three
paragraphs will prominently appear in all internal copies and
modifications; 2) no rights to sublicense or further distribute this
software are granted; 3) no rights to modify this dataset are granted;
and 4) no rights to assign this license are granted. Please Contact
the Carnegie Mellon University "CMU" Center for Technology Transfer
and Enterprise Creation, 4615 Forbes Avenue, Suite 302, Pittsburgh, PA
15213 - phone 412.268.7393, for commercial licensing opportunities, or
for further distribution, modification or license rights.
Created by Fadhl Eryani, Nizar Habash, Houda Bouamor, and Salam Khalifa.
IN NO EVENT SHALL CMU OR NYU, OR THEIR EMPLOYEES, OFFICERS, AGENTS OR
TRUSTEES ("COLLECTIVELY "CMU/NYU PARTIES") BE LIABLE TO ANY PARTY FOR
DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES OF ANY
KIND, INCLUDING LOST PROFITS, ARISING OUT OF ANY CLAIM RESULTING FROM
YOUR USE OF THIS DATASET AND ITS DOCUMENTATION, EVEN IF ANY OF CMU/NYU
PARTIES HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH CLAIM OR DAMAGE.
CMU/NYU SPECIFICALLY DISCLAIMS ANY WARRANTIES OF ANY KIND REGARDING
THE DATASET, INCLUDING, BUT NOT LIMITED TO, NON-INFRINGEMENT, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
PURPOSE, OR THE ACCURACY OR USEFULNESS, OR COMPLETENESS OF THE
SOFTWARE. THE SOFTWARE AND ACCOMPANYING DOCUMENTATION, IF ANY,
PROVIDED HEREUNDER IS PROVIDED COMPLETELY "AS IS". REGENTS HAS NO
OBLIGATION TO PROVIDE FURTHER DOCUMENTATION, MAINTENANCE, SUPPORT,
UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
Please cite Eryani et al. (2020) if you use the MADAR CODA Corpus in your research.
//////////////////////////////////////////////////////////////////////////////