Download Page
Qatar Arabic Language Bank


The Qatar Arabic Language Bank (QALB) Project is a joint project between the NLP Group at Carnegie Mellon University in Qatar (CMU-Q), and Columbia Arabic Dialect Modeling group (CADiM) at Columbia University, in collaboration with the University of Paris-Sud and the Computational Approaches to Modeling Language (CAMeL) Lab at New York University in Abu Dhabi. The QALB Project was funded by the Qatar National Research Fund (a member of the Qatar Foundation), grant NPRP-4-1058-1-168.

The project aims to build a large corpus of manually corrected Arabic text for building automatic correction tools. Furthermore, the project includes research on statistical techniques for automatic correction of Arabic text.

The data in this package includes portions of the QALB Corpus intended for the shared task on Automatic Arabic Error Correction (QALB-2015). The Shared Task was part of the Workshop on Arabic Natural Language Processing at ACL 2015. QALB-2015 is an extension of QALB-2014 which was part of the Workshop on Arabic Natural Language Processing at EMNLP 2014.

This special release of QALB-2015 includes (1) commentaries written in response to Al Jazeera articles and (2) L2 essays (texts written by learners of Arabic as a Second Language. All of the data includes corrections of language errors in these texts by native Arabic speakers at CMUQ. The Al Jazeera data is the same data that was used for QALB-2014.

Update 0.9.0 on 04 May 2021: Added the QALB 2015 test set references for (Alj-QALB2015-test and L2-QALB2015-test); and added the shared task description papers.