On the Limitations of Unsupervised Bilingual Dictionary Induction

Anders Søgaard; Sebastian  Ruder; Ivan Vulic

On the Limitations of Unsupervised Bilingual Dictionary Induction

Anders Søgaard, Sebastian Ruder, Ivan Vulic

59 Citationer (Scopus)

Abstract

Unsupervised machine translation—i.e.,not assuming any cross-lingual supervisionsignal, whether a dictionary, translations,or comparable corpora—seems impossible,but nevertheless, Lample et al.(2018a) recently proposed a fully unsupervisedmachine translation (MT) model.The model relies heavily on an adversarial,unsupervised alignment of word embeddingspaces for bilingual dictionary induction(Conneau et al., 2018), which weexamine here. Our results identify the limitationsof current unsupervised MT: unsupervisedbilingual dictionary inductionperforms much worse on morphologicallyrich languages that are not dependent marking,when monolingual corpora from differentdomains or different embedding algorithmsare used. We show that a simpletrick, exploiting a weak supervision signalfrom identical words, enables more robustinduction, and establish a near-perfectcorrelation between unsupervised bilingualdictionary induction performance and a previouslyunexplored graph similarity metric

Originalsprog	Engelsk
Titel	Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics : (Long papers)
Forlag	Association for Computational Linguistics
Publikationsdato	2018
Sider	778–788
Status	Udgivet - 2018
Begivenhed	56th Annual Meeting of the Association for Computational Linguistics - System Demonstrations - Melbourne, Australien Varighed: 15 jul. 2018 → 20 jul. 2018

Konference

Konference	56th Annual Meeting of the Association for Computational Linguistics - System Demonstrations
Land/Område	Australien
By	Melbourne
Periode	15/07/2018 → 20/07/2018

Citationsformater

On the Limitations of Unsupervised Bilingual Dictionary Induction. / Søgaard, Anders; Ruder, Sebastian ; Vulic, Ivan.
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics: (Long papers). Association for Computational Linguistics, 2018. s. 778–788.

Publikation: Bidrag til bog/antologi/rapport › Konferencebidrag i proceedings › Forskning › peer review

Søgaard, A, Ruder, S & Vulic, I 2018, On the Limitations of Unsupervised Bilingual Dictionary Induction. i Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics: (Long papers). Association for Computational Linguistics, s. 778–788, 56th Annual Meeting of the Association for Computational Linguistics - System Demonstrations, Melbourne, Australien, 15/07/2018.

@inproceedings{f82497c434ec4f1381db65541dd85098,

title = "On the Limitations of Unsupervised Bilingual Dictionary Induction",

abstract = "Unsupervised machine translation—i.e.,not assuming any cross-lingual supervisionsignal, whether a dictionary, translations,or comparable corpora—seems impossible,but nevertheless, Lample et al.(2018a) recently proposed a fully unsupervisedmachine translation (MT) model.The model relies heavily on an adversarial,unsupervised alignment of word embeddingspaces for bilingual dictionary induction(Conneau et al., 2018), which weexamine here. Our results identify the limitationsof current unsupervised MT: unsupervisedbilingual dictionary inductionperforms much worse on morphologicallyrich languages that are not dependent marking,when monolingual corpora from differentdomains or different embedding algorithmsare used. We show that a simpletrick, exploiting a weak supervision signalfrom identical words, enables more robustinduction, and establish a near-perfectcorrelation between unsupervised bilingualdictionary induction performance and a previouslyunexplored graph similarity metric",

author = "Anders S{\o}gaard and Sebastian Ruder and Ivan Vulic",

year = "2018",

language = "English",

pages = "778–788",

booktitle = "Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics",

publisher = "Association for Computational Linguistics",

note = " 56th Annual Meeting of the Association for Computational Linguistics - System Demonstrations ; Conference date: 15-07-2018 Through 20-07-2018",

}

TY - GEN

T1 - On the Limitations of Unsupervised Bilingual Dictionary Induction

AU - Søgaard, Anders

AU - Ruder, Sebastian

AU - Vulic, Ivan

PY - 2018

Y1 - 2018

N2 - Unsupervised machine translation—i.e.,not assuming any cross-lingual supervisionsignal, whether a dictionary, translations,or comparable corpora—seems impossible,but nevertheless, Lample et al.(2018a) recently proposed a fully unsupervisedmachine translation (MT) model.The model relies heavily on an adversarial,unsupervised alignment of word embeddingspaces for bilingual dictionary induction(Conneau et al., 2018), which weexamine here. Our results identify the limitationsof current unsupervised MT: unsupervisedbilingual dictionary inductionperforms much worse on morphologicallyrich languages that are not dependent marking,when monolingual corpora from differentdomains or different embedding algorithmsare used. We show that a simpletrick, exploiting a weak supervision signalfrom identical words, enables more robustinduction, and establish a near-perfectcorrelation between unsupervised bilingualdictionary induction performance and a previouslyunexplored graph similarity metric

AB - Unsupervised machine translation—i.e.,not assuming any cross-lingual supervisionsignal, whether a dictionary, translations,or comparable corpora—seems impossible,but nevertheless, Lample et al.(2018a) recently proposed a fully unsupervisedmachine translation (MT) model.The model relies heavily on an adversarial,unsupervised alignment of word embeddingspaces for bilingual dictionary induction(Conneau et al., 2018), which weexamine here. Our results identify the limitationsof current unsupervised MT: unsupervisedbilingual dictionary inductionperforms much worse on morphologicallyrich languages that are not dependent marking,when monolingual corpora from differentdomains or different embedding algorithmsare used. We show that a simpletrick, exploiting a weak supervision signalfrom identical words, enables more robustinduction, and establish a near-perfectcorrelation between unsupervised bilingualdictionary induction performance and a previouslyunexplored graph similarity metric

M3 - Article in proceedings

SP - 778

EP - 788

BT - Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics

PB - Association for Computational Linguistics

T2 - 56th Annual Meeting of the Association for Computational Linguistics - System Demonstrations

Y2 - 15 July 2018 through 20 July 2018

ER -