Manara - Qatar Research Repository
Browse

Statistical Models for Unsupervised, Semi-Supervised, and Supervised Transliteration Mining

Download (1.18 MB)
journal contribution
submitted on 2024-09-19, 06:46 and posted on 2024-09-19, 06:46 authored by Hassan Sajjad, Helmut Schmid, Alexander Fraser, Hinrich Schütze

We present a generative model that efficiently mines transliteration pairs in a consistent fashion in three different settings: unsupervised, semi-supervised, and supervised transliteration mining. The model interpolates two sub-models, one for the generation of transliteration pairs and one for the generation of non-transliteration pairs (i.e., noise). The model is trained on noisy unlabeled data using the EM algorithm. During training the transliteration sub-model learns to generate transliteration pairs and the fixed non-transliteration model generates the noise pairs. After training, the unlabeled data is disambiguated based on the posterior probabilities of the two sub-models. We evaluate our transliteration mining system on data from a transliteration mining shared task and on parallel corpora. For three out of four language pairs, our system outperforms all semi-supervised and supervised systems that participated in the NEWS 2010 shared task. On word pairs extracted from parallel corpora with fewer than 2% transliteration pairs, our system achieves up to 86.7% F-measure with 77.9% precision and 97.8% recall.

Other Information

Published in: Computational Linguistics
License: https://creativecommons.org/licenses/by-nc-nd/4.0/ 
See article on publisher's website: https://dx.doi.org/10.1162/coli_a_00286

Funding

Open Access funding provided by the Qatar National Library.

History

Language

  • English

Publisher

MIT Press

Publication Year

  • 2017

License statement

This Item is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Institution affiliated with

  • Hamad Bin Khalifa University
  • Qatar Computing Research Institute - HBKU

Usage metrics

    Qatar Computing Research Institute - HBKU

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC