MADAR (Multi-Arabic Dialect Applications and Resources) is a three-year joint project among the NLP Group at Carnegie Mellon University in Qatar (CMU-Q), the Computational Approaches to Modeling Language (CAMEL) Lab at New York University Abu Dhabi (NYUAD), and Columbia University. The project also involves collaborators from the University of Bahrain (UoB). The project aims at improving dialectal Arabic processing by:

The MADAR Project is the largest in scale and depth to date when it comes to working on natural language processing of Arabic dialects.




MADAR Corpus

The MADAR Corpus is a collection of parallel sentences covering the dialects of 25 cities from the Arab World, in addition to English, French, and Modern Standard Arabic (MSA). This corpus is created by translating selected sentences from the Basic Traveling Expression Corpus (BTEC)[1] in French and English to the different dialects. The MADAR Corpus will be made available soon to the research community under a non-commercial license. While we only provide the Arabic portions of the corpus, the English parallel text can be acquired directly from the USTAR consortium.

MADAR Lexicon

The MADAR lexicon is a collection of 1,045 concepts extracted from the MADAR Corpus defined in terms of triplets of words and phrases from English, French and MSA, along with multiple equivelant dialectal forms covering 25 cities from the Arab World. Each dialectal form includes its CODA orthography and CAPHI phonology [2]. The MADAR Lexicon will be made available soon to the research community under a non-commercial license.


To help collect morphological data for the project, we designed the MADAR Annotaion Interface (MADARi). MADARi is a web-based framework consisting of a management interface that allows the lead annotator to upload documents and assign tasks to annotators, and an annotation interfce that allows annotators to efficiently add morpological data to their assigned documents.

MADAR Lexicon Viewer

If you have any suggestions or fixes for the lexicon, please take some time and complete the suggestion form.

* Dialectal Arabic words are written in the Conventional Orthography for Dialectal Arabic (CODA). The guidelines for CODA are here.

* Pronunciation of Arabic words is provided in the CAMEL Arabic Phonetic Inventory (CAPHI), a simplified form of the international phonetic alphabet. The CAPHI guidelines are here.

MADAR Corpus Viewer

[1] Takezawa, T., Kikui, G., Mizushima, M., and Sumita, E. (2007). Multilingual Spoken Language Corpus Development for Communication Research. Computational Linguistics and Chinese Language Processing, 12(3):303–324.

[2] Habash, N., Eryani, F., Khalifa, S., Rambow, O., Abdulrahim, D., Erdmann, A., Faraj, R., Zaghouani, W., Bouamor, H., Zalmout, N., Hassan, S., shargi, F. A., Alkhereyf, S., Abdulkareem, B., Eskander, R., Salameh, M., and Saddiki, H. (2018). Unified Guidelines and Resources for Arabic Dialect Orthography. In Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan.