Manara - Qatar Research Repository
Browse

An Empirical Analysis of NMT-Derived Interlingual Embeddings and Their Use in Parallel Sentence Identification

Download (603.33 kB)
journal contribution
submitted on 2024-09-22, 09:11 and posted on 2024-09-22, 09:12 authored by Cristina Espana-Bonet, Adam Csaba Varga, Alberto Barron-Cedeno, Josef van Genabith

End-to-end neural machine translation has overtaken statistical machine translation in terms of translation quality for some language pairs, specially those with large amounts of parallel data. Besides this palpable improvement, neural networks provide several new properties. A single system can be trained to translate between many languages at almost no additional cost other than training time. Furthermore, internal representations learned by the network serve as a new semantic representation of words-or sentences-which, unlike standard word embeddings, are learned in an essentially bilingual or even multilingual context. In view of these properties, the contribution of the present paper is twofold. First, we systematically study the neural machine translation (NMT) context vectors, i.e., output of the encoder, and their power as an interlingua representation of a sentence. We assess their quality and effectiveness by measuring similarities across translations, as well as semantically related and semantically unrelated sentence pairs. Second, as extrinsic evaluation of the first point, we identify parallel sentences in comparable corpora, obtaining an F 1 =98.2% on data from a shared task when using only NMT context vectors. Using context vectors jointly with similarity measures F 1 reaches 98.9%.

Other Information

Published in: IEEE Journal of Selected Topics in Signal Processing
License: https://creativecommons.org/licenses/by/4.0/
See article on publisher's website: https://dx.doi.org/10.1109/jstsp.2017.2764273

Funding

Open Access funding provided by the Qatar National Library.

History

Language

  • English

Publisher

IEEE

Publication Year

  • 2017

License statement

This Item is licensed under the Creative Commons Attribution 4.0 International License.

Institution affiliated with

  • Hamad Bin Khalifa University
  • Qatar Computing Research Institute - HBKU

Usage metrics

    Qatar Computing Research Institute - HBKU

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC