The Qatar Arabic Language Bank (QALB) Project is a joint project between the NLP Group at Carnegie Mellon University in Qatar (CMU-Q), and Columbia Arabic Dialect Modeling group (CADiM) at Columbia University, in collaboration with the University of Paris-Sud and the Computational Approaches to Modeling Language (CAMeL) Lab at New York University in Abu Dhabi. The QALB Project was funded by the Qatar National Research Fund (a member of the Qatar Foundation), grant NPRP-4-1058-1-168.
The project aims to build a large corpus of manually corrected Arabic text for building automatic correction tools. Furthermore, the project includes research on statistical techniques for automatic correction of Arabic text.
The data in this package includes portions of the QALB Corpus intended for the shared task on Automatic Arabic Error Correction (QALB-2015). The Shared Task was part of the Workshop on Arabic Natural Language Processing at ACL 2015. QALB-2015 is an extension of QALB-2014 which was part of the Workshop on Arabic Natural Language Processing at EMNLP 2014.
This special release of QALB-2015 includes (1) commentaries written in response to Al Jazeera articles and (2) L2 essays (texts written by learners of Arabic as a Second Language. All of the data includes corrections of language errors in these texts by native Arabic speakers at CMUQ. The Al Jazeera data is the same data that was used for QALB-2014.
Update 0.9.0 on 04 May 2021: Added the QALB 2015 test set references
for (Alj-QALB2015-test and L2-QALB2015-test); and added the shared
task description papers.
By downloading the QALB Shared Task files from HERE you agree to the terms of the two license below.
//////////////////////////////////////////////////////////////////////
// License for QALB: Qatar Arabic Language Bank - WANLP-ACL 2015 Shared Task Data
Release 0.9.0 04 April 2021
//////////////////////////////////////////////////////////////////////
Copyright (c) 2015 Columbia University and the Carnegie Mellon University Qatar. All Rights Reserved.
A license to use and copy this dataset and its documentation solely for your internal research and evaluation purposes, without fee and without a signed licensing agreement, is hereby granted upon your download of the dataset, through which you agree to the following: 1) the above copyright notice, this paragraph and the following three paragraphs will prominently appear in all internal copies and modifications; 2) no rights to sublicense or further distribute this software are granted; 3) no rights to modify this dataset are granted; and 4) no rights to assign this license are granted. Please Contact the Carnegie Mellon University "CMU" Center for Technology Transfer and Enterprise Creation, 4615 Forbes Avenue, Suite 302, Pittsburgh, PA 15213 - phone 412.268.7393, for commercial licensing opportunities, or for further distribution, modification or license rights.
Created by Nizar Habash, Behrang Mohit, Wajdi Zaghouani, Ossama Obeid, Nadi Tomeh, Alla Rozovskaya, Noura Farra, Sarah Alkuhlani, Houda Bouamor and Kemal Oflazer.
IN NO EVENT SHALL CMU OR COLUMBIA, OR THEIR EMPLOYEES, OFFICERS, AGENTS OR TRUSTEES ("COLLECTIVELY "CMU/COLUMBIA PARTIES") BE LIABLE TO ANY PARTY FOR DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES OF ANY KIND, INCLUDING LOST PROFITS, ARISING OUT OF ANY CLAIM RESULTING FROM YOUR USE OF THIS DATASET AND ITS DOCUMENTATION, EVEN IF ANY OF CMU/COLUMBIA PARTIES HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH CLAIM OR DAMAGE.
CMU/COLUMBIA SPECIFICALLY DISCLAIMS ANY WARRANTIES OF ANY KIND REGARDING THE DATASET, INCLUDING, BUT NOT LIMITED TO, NON-INFRINGEMENT, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE, OR THE ACCURACY OR USEFULNESS, OR COMPLETENESS OF THE SOFTWARE. THE SOFTWARE AND ACCOMPANYING DOCUMENTATION, IF ANY, PROVIDED HEREUNDER IS PROVIDED COMPLETELY "AS IS". REGENTS HAS NO OBLIGATION TO PROVIDE FURTHER DOCUMENTATION, MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
If you use this resource, cite the following papers:
Rozovskaya, Alla, Houda Bouamor, Nizar Habash, Wajdi Zaghouani, Ossama Obeid, and Behrang Mohit. "The second qalb shared task on automatic text correction for Arabic." In Proceedings of the Second workshop on Arabic natural language processing, pp. 26-35. 2015.
Zaghouani, Wajdi, Nizar Habash, Houda Bouamor, Alla Rozovskaya, Behrang Mohit, Abeer Heider, and Kemal Oflazer. "Correction annotation for non-native Arabic texts: Guidelines and corpus." In Proceedings of The 9th Linguistic Annotation Workshop, pp. 129-139. 2015.
Mohit, Behrang, Alla Rozovskaya, Nizar Habash, Wajdi Zaghouani, and Ossama Obeid. "The first QALB shared task on automatic text correction for Arabic." In Proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing (ANLP), pp. 39-47. 2014.
Wajdi Zaghouani, Behrang Mohit, Nizar Habash, Ossama Obeid, Nadi Tomeh, Alla Rozovskaya, Noura Farra, Sarah Alkuhlani and Kemal Oflazer. "Large-scale Arabic Error Annotation: Guidelines and Framework." In Proceedings of the 9th Conference on Language Resources and Evaluation Conference (LREC-2014).
Ossama Obeid, Wajdi Zaghouani, Behrang Mohit, Nizar Habash, Kemal Oflazer and Nadi Tomeh, 2013. "A Web-based Annotation Framework For Large-Scale Text Correction." In Proceedings of the 6th International Joint Conference on Natural Language Processing (IJCNLP-2013).
//////////////////////////////////////////////////////////////////////