On the Limitations of Unsupervised Bilingual Dictionary Induction

Anders Søgaard, Sebastian Ruder, Ivan Vulic

    59 Citationer (Scopus)

    Abstract

    Unsupervised machine translation—i.e.,not assuming any cross-lingual supervisionsignal, whether a dictionary, translations,or comparable corpora—seems impossible,but nevertheless, Lample et al.(2018a) recently proposed a fully unsupervisedmachine translation (MT) model.The model relies heavily on an adversarial,unsupervised alignment of word embeddingspaces for bilingual dictionary induction(Conneau et al., 2018), which weexamine here. Our results identify the limitationsof current unsupervised MT: unsupervisedbilingual dictionary inductionperforms much worse on morphologicallyrich languages that are not dependent marking,when monolingual corpora from differentdomains or different embedding algorithmsare used. We show that a simpletrick, exploiting a weak supervision signalfrom identical words, enables more robustinduction, and establish a near-perfectcorrelation between unsupervised bilingualdictionary induction performance and a previouslyunexplored graph similarity metric
    OriginalsprogEngelsk
    TitelProceedings of the 56th Annual Meeting of the Association for Computational Linguistics : (Long papers)
    ForlagAssociation for Computational Linguistics
    Publikationsdato2018
    Sider778–788
    StatusUdgivet - 2018
    Begivenhed 56th Annual Meeting of the Association for Computational Linguistics - System Demonstrations - Melbourne, Australien
    Varighed: 15 jul. 201820 jul. 2018

    Konference

    Konference 56th Annual Meeting of the Association for Computational Linguistics - System Demonstrations
    Land/OmrådeAustralien
    ByMelbourne
    Periode15/07/201820/07/2018

    Citationsformater