Manara - Qatar Research Repository
Browse

Neural POS tagging of shahmukhi by using contextualized word representations

Download (3.01 MB)
journal contribution
submitted on 2024-08-14, 05:25 and posted on 2024-08-14, 05:26 authored by Amina Tehseen, Toqeer Ehsan, Hannan Bin Liaqat, Amjad Ali, Ala Al-Fuqaha

Part of Speech (POS) tagging has a preliminary role in building natural language processing applications. This paper presents the development and evaluation of the first POS tagged corpus along with a Bi-directional long-short memory (BiLSTM) network based POS tagger for Shahmukhi (Western Punjabi) at this scale. A balanced corpus of 0.13 million words has been annotated which contains text from 14 different text domains. A Shahmukhi POS tagset has been devised by studying the applicability of the CLE Urdu POS tagset and tagging guidelines have also been designed for annotation. A multi-step corpus evaluation process has been employed for tagged corpus including grammar-based and n-gram based consistency evaluations. The average inter-annotator agreement for all domains is 95.35% along with an average Kappa coefficient of 0.94. The performance of the BiLSTM POS tagger has been compared with the well-known language independent TreeTagger and the Stanford POS tagger. The accuracy of the tagger has been further improved by employing transfer learning by training context-free (Word2Vec) and contextualized (ELMo) word representations on a corpus of 14.9 Shahmukhi words which has been collected from World Wide Web. The tagger performed with an f-score of 96.11 and the accuracy of 96.12%. For a morphologically-rich and low-resourced language, these POS tagging results are quite promising.

Other Information

Published in: Journal of King Saud University - Computer and Information Sciences
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
See article on publisher's website: https://dx.doi.org/10.1016/j.jksuci.2022.12.004

Funding

Open Access funding provided by the Qatar National Library.

History

Language

  • English

Publisher

Elsevier

Publication Year

  • 2023

License statement

This Item is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Institution affiliated with

  • Hamad Bin Khalifa University
  • College of Science and Engineering - HBKU

Usage metrics

    College of Science and Engineering - HBKU

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC