Unsupervised Induction of Linguistic Categories with Records of Reading, Speaking, and Writing

Maria Jung Barrett; Ana Valeria Gonzalez; Lea Frermann; Anders Søgaard

doi:10.18653/v1/n18-1184

Unsupervised Induction of Linguistic Categories with Records of Reading, Speaking, and Writing

Maria Jung Barrett, Ana Valeria Gonzalez, Lea Frermann, Anders Søgaard

1 Citation (Scopus)

Abstract

When learning POS taggers and syntactic chunkers for low-resource languages, different resources may be available, and often all we have is a small tag dictionary, motivating type-constrained unsupervised induction. Even small dictionaries can improve the performance of unsupervised induction algorithms. This paper shows that performance can be further improved by including data that is readily available or can be easily obtained for most languages, i.e., eye-tracking, speech, or keystroke logs (or any combination thereof). We project information from all these data sources into shared spaces, in which the union of words is represented. For English unsupervised POS induction, the additional information, which is not required at test time, leads to an average error reduction on Ontonotes domains of 1.5% over systems augmented with state-of-the-art word embeddings. On Penn Treebank the best model achieves 5.4% error reduction over a word embeddings baseline. We also achieve significant improvements for syntactic chunk induction. Our analysis shows that improvements are even bigger when the available tag dictionaries are smaller.

Original language	English
Title of host publication	Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL) : Human Language Technologies, (Long Papers)
Editors	Silvio Ricardo Cordeiro , Shereen Oraby, Umashanthi Pavalanathan, Kyeongmin Rim
Number of pages	11
Volume	1
Publisher	Association for Computational Linguistics
Publication date	2018
Pages	2028-2038
ISBN (Print)	978-1-948087-27-8
DOIs	https://doi.org/10.18653/v1/n18-1184
Publication status	Published - 2018
Event	16th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - New Orleans, United States Duration: 1 Jun 2018 → 6 Jun 2018

Conference

Conference	16th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Country/Territory	United States
City	New Orleans
Period	01/06/2018 → 06/06/2018

Access to Document

10.18653/v1/n18-1184

N18-1184.pdfFinal published version, 275 KBLicence: CC BY
N18-1184.pdfFinal published version, 275 KBLicence: CC BY

http://www.aclweb.org/anthology/N18-1184Licence: GNU LGPL
https://www.aclweb.org/anthology/N18-1184.pdfLicence: CC BY

Cite this

Barrett, M. J., Gonzalez, A. V., Frermann, L., & Søgaard, A. (2018). Unsupervised Induction of Linguistic Categories with Records of Reading, Speaking, and Writing. In S. R. Cordeiro , S. Oraby, U. Pavalanathan, & K. Rim (Eds.), Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL): Human Language Technologies, (Long Papers) (Vol. 1, pp. 2028-2038). Association for Computational Linguistics. https://doi.org/10.18653/v1/n18-1184

Unsupervised Induction of Linguistic Categories with Records of Reading, Speaking, and Writing. / Barrett, Maria Jung ; Gonzalez, Ana Valeria; Frermann, Lea et al.
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL): Human Language Technologies, (Long Papers) . ed. / Silvio Ricardo Cordeiro ; Shereen Oraby; Umashanthi Pavalanathan; Kyeongmin Rim. Vol. 1 Association for Computational Linguistics, 2018. p. 2028-2038.

Research output: Chapter in Book/Report/Conference proceeding › Article in proceedings › Research › peer-review

Barrett, MJ , Gonzalez, AV, Frermann, L & Søgaard, A 2018, Unsupervised Induction of Linguistic Categories with Records of Reading, Speaking, and Writing. in SR Cordeiro , S Oraby, U Pavalanathan & K Rim (eds), Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL): Human Language Technologies, (Long Papers) . vol. 1, Association for Computational Linguistics, pp. 2028-2038, 16th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, New Orleans, United States, 01/06/2018. https://doi.org/10.18653/v1/n18-1184

Barrett MJ , Gonzalez AV, Frermann L, Søgaard A. Unsupervised Induction of Linguistic Categories with Records of Reading, Speaking, and Writing. In Cordeiro SR, Oraby S, Pavalanathan U, Rim K, editors, Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL): Human Language Technologies, (Long Papers) . Vol. 1. Association for Computational Linguistics. 2018. p. 2028-2038 doi: 10.18653/v1/n18-1184

Barrett, Maria Jung ; Gonzalez, Ana Valeria ; Frermann, Lea et al. / Unsupervised Induction of Linguistic Categories with Records of Reading, Speaking, and Writing. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL): Human Language Technologies, (Long Papers) . editor / Silvio Ricardo Cordeiro ; Shereen Oraby ; Umashanthi Pavalanathan ; Kyeongmin Rim. Vol. 1 Association for Computational Linguistics, 2018. pp. 2028-2038

@inproceedings{2f629969d6ca4f4dada21de93ed6b658,

title = "Unsupervised Induction of Linguistic Categories with Records of Reading, Speaking, and Writing",

abstract = "When learning POS taggers and syntactic chunkers for low-resource languages, different resources may be available, and often all we have is a small tag dictionary, motivating type-constrained unsupervised induction. Even small dictionaries can improve the performance of unsupervised induction algorithms. This paper shows that performance can be further improved by including data that is readily available or can be easily obtained for most languages, i.e., eye-tracking, speech, or keystroke logs (or any combination thereof). We project information from all these data sources into shared spaces, in which the union of words is represented. For English unsupervised POS induction, the additional information, which is not required at test time, leads to an average error reduction on Ontonotes domains of 1.5% over systems augmented with state-of-the-art word embeddings. On Penn Treebank the best model achieves 5.4% error reduction over a word embeddings baseline. We also achieve significant improvements for syntactic chunk induction. Our analysis shows that improvements are even bigger when the available tag dictionaries are smaller. ",

author = "Barrett, {Maria Jung} and Gonzalez, {Ana Valeria} and Lea Frermann and Anders S{\o}gaard",

year = "2018",

doi = "10.18653/v1/n18-1184",

language = "English",

isbn = "978-1-948087-27-8",

volume = "1",

pages = "2028--2038",

editor = "{Cordeiro }, {Silvio Ricardo } and Oraby, {Shereen } and Pavalanathan, {Umashanthi } and Kyeongmin Rim",

booktitle = "Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL)",

publisher = "Association for Computational Linguistics",

note = "16th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2018 ; Conference date: 01-06-2018 Through 06-06-2018",

}

TY - GEN

T1 - Unsupervised Induction of Linguistic Categories with Records of Reading, Speaking, and Writing

AU - Barrett, Maria Jung

AU - Gonzalez, Ana Valeria

AU - Frermann, Lea

AU - Søgaard, Anders

PY - 2018

Y1 - 2018

N2 - When learning POS taggers and syntactic chunkers for low-resource languages, different resources may be available, and often all we have is a small tag dictionary, motivating type-constrained unsupervised induction. Even small dictionaries can improve the performance of unsupervised induction algorithms. This paper shows that performance can be further improved by including data that is readily available or can be easily obtained for most languages, i.e., eye-tracking, speech, or keystroke logs (or any combination thereof). We project information from all these data sources into shared spaces, in which the union of words is represented. For English unsupervised POS induction, the additional information, which is not required at test time, leads to an average error reduction on Ontonotes domains of 1.5% over systems augmented with state-of-the-art word embeddings. On Penn Treebank the best model achieves 5.4% error reduction over a word embeddings baseline. We also achieve significant improvements for syntactic chunk induction. Our analysis shows that improvements are even bigger when the available tag dictionaries are smaller.

AB - When learning POS taggers and syntactic chunkers for low-resource languages, different resources may be available, and often all we have is a small tag dictionary, motivating type-constrained unsupervised induction. Even small dictionaries can improve the performance of unsupervised induction algorithms. This paper shows that performance can be further improved by including data that is readily available or can be easily obtained for most languages, i.e., eye-tracking, speech, or keystroke logs (or any combination thereof). We project information from all these data sources into shared spaces, in which the union of words is represented. For English unsupervised POS induction, the additional information, which is not required at test time, leads to an average error reduction on Ontonotes domains of 1.5% over systems augmented with state-of-the-art word embeddings. On Penn Treebank the best model achieves 5.4% error reduction over a word embeddings baseline. We also achieve significant improvements for syntactic chunk induction. Our analysis shows that improvements are even bigger when the available tag dictionaries are smaller.

U2 - 10.18653/v1/n18-1184

DO - 10.18653/v1/n18-1184

M3 - Article in proceedings

SN - 978-1-948087-27-8

VL - 1

SP - 2028

EP - 2038

BT - Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL)

A2 - Cordeiro , Silvio Ricardo

A2 - Oraby, Shereen

A2 - Pavalanathan, Umashanthi

A2 - Rim, Kyeongmin

PB - Association for Computational Linguistics

T2 - 16th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Y2 - 1 June 2018 through 6 June 2018

ER -

Unsupervised Induction of Linguistic Categories with Records of Reading, Speaking, and Writing

Abstract

Conference

Access to Document

Fingerprint

Cite this