MADAR (Multi-Arabic Dialect Applications and Resources) is a three-year joint project among the NLP Group at Carnegie Mellon University in Qatar (CMU-Q), the Computational Approaches to Modeling Language (CAMEL) Lab at New York University Abu Dhabi (NYUAD), and Columbia University. The project also involves collaborators from the University of Bahrain (UoB). The project aims at improving dialectal Arabic processing by:
@inproceedings{bouamor-etal-2019-madar, title = {The MADAR Shared Task on Arabic Fine-Grained Dialect Identification}, author = {Houda Bouamor and Sabit Hassan and Nizar Habash}, booktitle = {Proceedings of the Fourth Arabic Natural Language Processing Workshop}, year = {2019}, address = {Florence, Italy}, language = {english} }
@inproceedings{erdmann-etal-2019-little, title = {A Little Linguistics Goes a Long Way: Unsupervised Segmentation with Limited Language Specific Guidance}, author = {Alexander Erdmann and Salam Khalifa and Mai Oudah and Nizar Habash and Houda Bouamor}", booktitle = {Proceedings of the 16th Workshop on Computational Research in Phonetics, Phonology, and Morphology}, year = {2019}, address = {Florence, Italy}, language = {english} }
@inproceedings{obeid-etal-2019-adida, author = {Ossama Obeid and Mohammad Salameh and Houda Bouamor and Nizar Habash}, title = {ADIDA: Automatic Dialect Identification for Arabic}, booktitle = {Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations)}, year = {2019}, address = {Minneapolis, Minnesota}, language = {english} }
@inproceedings{Erdmann-etal-2018, author = {Alexander Erdmann and Nasser Zalmout and Nizar Habash}, title = {Addressing Noise in Multidialectal Word Embeddings}, booktitle = {Proceedings of Conference of the Association for Computational Linguistics}, year = {2018}, address = {Melbourne, Australia}, language = {english} }
@inproceedings{Habash-etal-2018, author = {Nizar Habash and Salam Khalifa and Fadhl Eryani and Owen Rambow and Dana Abdulrahim and Alexander Erdmann and Reem Faraj and Wajdi Zaghouani and Houda Bouamor and Nasser Zalmout and Sara Hassan and Faisal Al Shargi and Sakhar Alkhereyf and Basma Abdulkareem and Ramy Eskander and Mohammad Salameh and Hind Saddiki}, title = {Unified Guidelines and Resources for Arabic Dialect Orthography}, booktitle = {The International Conference on Language Resources and Evaluation}, year = {2018}, address = {Miyazaki, Japan}, language = {english} }
@inproceedings{salameh-etal-2018-fine, title = {Fine-Grained Arabic Dialect Identification}, author = {Mohammad Salameh and Houda Bouamor and Nizar Habash}, booktitle = {Proceedings of the 27th International Conference on Computational Linguistics}, year = {2018}, address = {Santa Fe, New Mexico, USA}, language = {english} }
@inproceedings{Obeid-etal-2018, author = {Ossama Obeid and Salam Khalifa and Nizar Habash and Houda Bouamor and Wajdi Zaghouani and Kemal Oflazer}, title = {MADARi: A Web Interface for Joint Arabic Morphological Annotation and Spelling Correction}, booktitle = {The International Conference on Language Resources and Evaluation}, year = {2018}, address = {Miyazaki, Japan}, language = {english} }
@inproceedings{Bouamor-etal-2018, author = {Houda Bouamor and Nizar Habash and Mohammad Salameh and Wajdi Zaghouani and Owen Rambow and Dana Abdulrahim and Ossama Obeid and Salam Khalifa and Fadhl Eryani and Alexander Erdmann and Kemal Oflazer}, title = {The MADAR Arabic Dialect Corpus and Lexicon}, booktitle = {The International Conference on Language Resources and Evaluation}, year = {2018}, address = {Miyazaki, Japan}, language = {english} }
@inproceedings{Bouamor-etal-2018, author = {Alexander Erdmann and Nizar Habash and Dima Taji and Houda Bouamor}, title = {Low Resourced Machine Translation via Morpho-syntactic Modeling: The Case of Dialectal Arabic}, booktitle = {Proceedings of the Machine Translation Summit}, year = {2017}, address = {Nagoya, Japan}, language = {english} }
The MADAR Corpus is a collection of parallel sentences covering the dialects of 25 cities from the Arab World, in addition to English, French, and Modern Standard Arabic (MSA). This corpus is created by translating selected sentences from the Basic Traveling Expression Corpus (BTEC)[1] in French and English to the different dialects. The MADAR Corpus will be made available soon to the research community under a non-commercial license. While we only provide the Arabic portions of the corpus, the English parallel text can be acquired directly from the USTAR consortium.
The MADAR lexicon is a collection of 1,045 concepts extracted from the MADAR Corpus defined in terms of triplets of words and phrases from English, French and MSA, along with multiple equivelant dialectal forms covering 25 cities from the Arab World. Each dialectal form includes its CODA orthography and CAPHI phonology [2]. The MADAR Lexicon will be made available soon to the research community under a non-commercial license.
To help collect morphological data for the project, we designed the MADAR Annotaion Interface (MADARi). MADARi is a web-based framework consisting of a management interface that allows the lead annotator to upload documents and assign tasks to annotators, and an annotation interfce that allows annotators to efficiently add morpological data to their assigned documents.
If you have any suggestions or fixes for the lexicon, please take some time and complete the suggestion form.
* Dialectal Arabic words are written in the Conventional Orthography for Dialectal Arabic (CODA). The guidelines for CODA are here.
* Pronunciation of Arabic words is provided in the CAMEL Arabic Phonetic Inventory (CAPHI), a simplified form of the international phonetic alphabet. The CAPHI guidelines are here.
[1] Takezawa, T., Kikui, G., Mizushima, M., and Sumita, E. (2007). Multilingual Spoken Language Corpus Development for Communication Research. Computational Linguistics and Chinese Language Processing, 12(3):303–324.
[2] Habash, N., Eryani, F., Khalifa, S., Rambow, O., Abdulrahim, D., Erdmann, A., Faraj, R., Zaghouani, W., Bouamor, H., Zalmout, N., Hassan, S., shargi, F. A., Alkhereyf, S., Abdulkareem, B., Eskander, R., Salameh, M., and Saddiki, H. (2018). Unified Guidelines and Resources for Arabic Dialect Orthography. In Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan.