Arabic Dialect Identification
With big streams of data written in dialectal Arabic from social medias, researchers shifted their focus from Modern Standard Arabic (MSA) into dialectal Arabic. Some researchers have also left the rich text-mining tools library tailored for MSA behind and started developing dialect-specific tools from scratch. Meanwhile, other researchers have chose to invest in utilizing the existed MSA tools by extending their validity to dialects. Regardless of the decision a researcher made in dealing with the Arabic dialects, the first challenge will always remain the same: How to identify the Arabic variant(s) the data is written in? The dialect identification task is classically approached by hiring human annotators. Multiple annotators are commonly assigned for labeling each sentence in order to maintain good accuracy. The needed time and cost to finish the task are directly proportional to the size of data. Baring on mind the big size of on-line data, using the classical method is not very practical. In this paper, a recent machine-based approach is explored. The dataset employed is an open-source dialectal dataset which is labeled using source information. Features are sub-word tokens extracted with a trained BPE-based segmentation model. A separate n-gram model is trained for each dialect appeared in the dataset. When a new sequence of features is passed to the system, each dialectal language model scores the sequence. The sequence will be labeled with the dialect corresponds to the model with the highest probabilistic score. By incorporating segmentation in the dialect identification framework, 5 points improvement were yielded over baseline results. Similar performance was maintained when applied for out of domain test datasets. Therefore, for distinguishing between closely related languages with morphological differences like Arabic dialects, segmentation could help in extracting frequent language-specific sub-word features and reducing data sparsity.
History
Language
- English
Publication Year
- 2018
License statement
© The author. The author has granted HBKU and Qatar Foundation a non-exclusive, worldwide, perpetual, irrevocable, royalty-free license to reproduce, display and distribute the manuscript in whole or in part in any form to be posted in digital or print format and made available to the public at no charge. Unless otherwise specified in the copyright statement or the metadata, all rights are reserved by the copyright holder. For permission to reuse content, please contact the author.Institution affiliated with
- Hamad Bin Khalifa University
- College of Science and Engineering - HBKU
Degree Date
- 2018
Degree Type
- Master's