When POS datasets don’t add up: Combatting sample bias

Dirk Hovy; Barbara Plank; Anders Søgaard

When POS datasets don’t add up: Combatting sample bias

Dirk Hovy, Barbara Plank, Anders Søgaard

LUKKET: Center for Sprogteknologi

Abstract

Several works in Natural Language Processing have recently looked into part-of-speech (POS) annotation of Twitter data and typically used their own data sets. Since conventions on Twitter change rapidly, models often show sample bias. Training on a combination of the existing data sets should help overcome this bias and produce more robust models than any trained on the individual corpora. Unfortunately, combining the existing corpora proves difficult: many of the corpora use proprietary tag sets that have little or no overlap. Even when mapped to a common tag set, the different corpora systematically differ in their treatment of various tags and tokens. This includes both preprocessing decisions, as well as default labels for frequent tokens, thus exhibiting data bias and label bias, respectively. Only if we address these biases can we combine the existing data sets to also overcome sample bias. We present a systematic study of several Twitter POS data sets, the problems of label and data bias, discuss their effects on model performance, and show how to overcome them to learn models that perform well on various test sets, achieving relative error reduction of up to 21%.

Originalsprog	Engelsk
Titel	Proceedings of the 9th International Conference on Language Resources and Evaluation : LREC2014
Udgivelsessted	Reykjavik, Iceland
Forlag	European Language Resources Association
Publikationsdato	2014
ISBN (Elektronisk)	978-2-9517408-8-4
Status	Udgivet - 2014

Citationsformater

When POS datasets don’t add up: Combatting sample bias. / Hovy, Dirk; Plank, Barbara; Søgaard, Anders.

Proceedings of the 9th International Conference on Language Resources and Evaluation : LREC2014. Reykjavik, Iceland : European Language Resources Association, 2014.

Publikation: Bidrag til bog/antologi/rapport › Konferencebidrag i proceedings › Forskning › peer review

@inproceedings{fcd4a435c0e441aca0778ac14233cc3a,

title = "When POS datasets don{\textquoteright}t add up: Combatting sample bias",

abstract = "Several works in Natural Language Processing have recently looked into part-of-speech (POS) annotation of Twitter data and typically used their own data sets. Since conventions on Twitter change rapidly, models often show sample bias. Training on a combination of the existing data sets should help overcome this bias and produce more robust models than any trained on the individual corpora. Unfortunately, combining the existing corpora proves difficult: many of the corpora use proprietary tag sets that have little or no overlap. Even when mapped to a common tag set, the different corpora systematically differ in their treatment of various tags and tokens. This includes both preprocessing decisions, as well as default labels for frequent tokens, thus exhibiting data bias and label bias, respectively. Only if we address these biases can we combine the existing data sets to also overcome sample bias. We present a systematic study of several Twitter POS data sets, the problems of label and data bias, discuss their effects on model performance, and show how to overcome them to learn models that perform well on various test sets, achieving relative error reduction of up to 21%.",

author = "Dirk Hovy and Barbara Plank and Anders S{\o}gaard",

year = "2014",

language = "English",

booktitle = "Proceedings of the 9th International Conference on Language Resources and Evaluation",

publisher = "European Language Resources Association",

}

TY - GEN

T1 - When POS datasets don’t add up: Combatting sample bias

AU - Hovy, Dirk

AU - Plank, Barbara

AU - Søgaard, Anders

PY - 2014

Y1 - 2014

N2 - Several works in Natural Language Processing have recently looked into part-of-speech (POS) annotation of Twitter data and typically used their own data sets. Since conventions on Twitter change rapidly, models often show sample bias. Training on a combination of the existing data sets should help overcome this bias and produce more robust models than any trained on the individual corpora. Unfortunately, combining the existing corpora proves difficult: many of the corpora use proprietary tag sets that have little or no overlap. Even when mapped to a common tag set, the different corpora systematically differ in their treatment of various tags and tokens. This includes both preprocessing decisions, as well as default labels for frequent tokens, thus exhibiting data bias and label bias, respectively. Only if we address these biases can we combine the existing data sets to also overcome sample bias. We present a systematic study of several Twitter POS data sets, the problems of label and data bias, discuss their effects on model performance, and show how to overcome them to learn models that perform well on various test sets, achieving relative error reduction of up to 21%.

AB - Several works in Natural Language Processing have recently looked into part-of-speech (POS) annotation of Twitter data and typically used their own data sets. Since conventions on Twitter change rapidly, models often show sample bias. Training on a combination of the existing data sets should help overcome this bias and produce more robust models than any trained on the individual corpora. Unfortunately, combining the existing corpora proves difficult: many of the corpora use proprietary tag sets that have little or no overlap. Even when mapped to a common tag set, the different corpora systematically differ in their treatment of various tags and tokens. This includes both preprocessing decisions, as well as default labels for frequent tokens, thus exhibiting data bias and label bias, respectively. Only if we address these biases can we combine the existing data sets to also overcome sample bias. We present a systematic study of several Twitter POS data sets, the problems of label and data bias, discuss their effects on model performance, and show how to overcome them to learn models that perform well on various test sets, achieving relative error reduction of up to 21%.

M3 - Article in proceedings

BT - Proceedings of the 9th International Conference on Language Resources and Evaluation

PB - European Language Resources Association

CY - Reykjavik, Iceland

ER -

When POS datasets don’t add up: Combatting sample bias

Abstract

Fingeraftryk

Citationsformater