The Annotated Gumar Corpus is a manually annotated corpus of Gulf Arabic, specifically Emirati Arabic. The corpus consists of 200,000 words selected from eight different novels from the Gumar Corpus. Each word is annotated in context for tokenization, part-of-speech, lemmatization, spelling adjustment, English glosses, and sentence level dialect identification.
This resource was developed at the Computational Approaches to Modeling Language (CAMeL) Lab in New York University Abu Dhabi. It was supported by a New York University Abu Dhabi Research Enhancement Fund grant.
By downloading The Annotated Gumar Corpus files from HERE you agree to the terms of the license below.
//////////////////////////////////////////////////////////////////////////////
// License for The Annotated Gumar Corpus
//////////////////////////////////////////////////////////////////////////////
Copyright 2018-2020 New York University Abu Dhabi. All Rights Reserved. A license to
use and copy this software, data and its documentation solely for your
internal research and evaluation purposes, without fee and without a
signed licensing agreement, is hereby granted upon your download of
the software, through which you agree to the following: 1) the above
copyright notice, this paragraph and the following three paragraphs
will prominently appear in all internal copies and modifications; 2)
no rights to sublicense or further distribute this software are
granted; 3) no rights to modify this software are granted; and 4) no
rights to assign this license are granted. Please Contact the Office
of Industrial Liaison, New York University, One Park Avenue, 6th
Floor, New York, NY 10016 (212) 263-8178, for commercial licensing
opportunities, or for further distribution, modification or license
rights.
Created by Salam Khalifa, Nizar Habash, Fadhl Eryani, Ossama Obeid at the
Computational Approaches to Modeling Language (CAMeL) Lab in New York University
Abu Dhabi.
IN NO EVENT SHALL NYU, OR ITS EMPLOYEES, OFFICERS, AGENTS OR TRUSTEES
("COLLECTIVELY "NYU PARTIES") BE LIABLE TO ANY PARTY FOR DIRECT,
INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES OF ANY KIND ,
INCLUDING LOST PROFITS, ARISING OUT OF ANY CLAIM RESULTING FROM YOUR
USE OF THIS SOFTWARE, DATA AND ITS DOCUMENTATION, EVEN IF ANY OF NYU
PARTIES HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH CLAIM OR DAMAGE.
NYU SPECIFICALLY DISCLAIMS ANY WARRANTIES OF ANY KIND REGARDING THE
SOFTWARE and DATA, INCLUDING, BUT NOT LIMITED TO, NON-INFRINGEMENT,
THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
PURPOSE, OR THE ACCURACY OR USEFULNESS, OR COMPLETENESS OF THE
SOFTWARE. THE SOFTWARE AND ACCOMPANYING DOCUMENTATION, IF ANY,
PROVIDED HEREUNDER IS PROVIDED COMPLETELY "AS IS". REGENTS HAS NO
OBLIGATION TO PROVIDE FURTHER DOCUMENTATION, MAINTENANCE, SUPPORT,
UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
Please cite Khalifa et al. (2018, 2020) if you use The Annotated Gumar Corpus in your research:
Khalifa, Salam, Nizar Habash, Fadhl Eryani, Ossama Obeid, Dana Abdulrahim, and Meera Al Kaabi.
A Morphologically Annotated Corpus of Emirati Arabic. In The Proceedings of LREC 2018.
Khalifa, Salam, Nasser Zalmout, and Nizar Habash. Morphological Analysis and Disambiguation for Gulf Arabic:
The Interplay between Resources and Methods. In The Proceedings of LREC 2020.
//////////////////////////////////////////////////////////////////////////////