Crowdsourcing and annotating NER for Twitter #drift

Hege Fromreide; Dirk Hovy; Anders Søgaard

Crowdsourcing and annotating NER for Twitter #drift

Hege Fromreide, Dirk Hovy, Anders Søgaard

LUKKET: Center for Sprogteknologi

20 Citationer (Scopus)

Abstract

We present two new NER datasets for Twitter; a manually annotated set of 1, 467 tweets (k = 0.942) and a set of 2, 975 expert-corrected, crowdsourced NER annotated tweets from the dataset described in Finin et al. (2010). In our experiments with these datasets, we observe two important points: (a) language drift on Twitter is significant, and while off-the-shelf systems have been reported to perform well on in-sample data, they often perform poorly on new samples of tweets, (b) state-of-the-art performance across various datasets can be obtained from crowdsourced annotations, making it more feasible to "catch up" with language drift.

Originalsprog	Engelsk
Titel	Proceedings of the 9th International Conference on Language Resources and Evaluation : LREC2014
Forlag	European Language Resources Association
Publikationsdato	2014
Status	Udgivet - 2014

Citationsformater

@inproceedings{a456905f3d6a4e2197d4eca23f890a7a,

title = "Crowdsourcing and annotating NER for Twitter #drift",

abstract = "We present two new NER datasets for Twitter; a manually annotated set of 1, 467 tweets (k = 0.942) and a set of 2, 975 expert-corrected, crowdsourced NER annotated tweets from the dataset described in Finin et al. (2010). In our experiments with these datasets, we observe two important points: (a) language drift on Twitter is significant, and while off-the-shelf systems have been reported to perform well on in-sample data, they often perform poorly on new samples of tweets, (b) state-of-the-art performance across various datasets can be obtained from crowdsourced annotations, making it more feasible to {"}catch up{"} with language drift.",

author = "Hege Fromreide and Dirk Hovy and Anders S{\o}gaard",

year = "2014",

language = "English",

booktitle = "Proceedings of the 9th International Conference on Language Resources and Evaluation",

publisher = "European Language Resources Association",

}

TY - GEN

T1 - Crowdsourcing and annotating NER for Twitter #drift

AU - Fromreide, Hege

AU - Hovy, Dirk

AU - Søgaard, Anders

PY - 2014

Y1 - 2014

N2 - We present two new NER datasets for Twitter; a manually annotated set of 1, 467 tweets (k = 0.942) and a set of 2, 975 expert-corrected, crowdsourced NER annotated tweets from the dataset described in Finin et al. (2010). In our experiments with these datasets, we observe two important points: (a) language drift on Twitter is significant, and while off-the-shelf systems have been reported to perform well on in-sample data, they often perform poorly on new samples of tweets, (b) state-of-the-art performance across various datasets can be obtained from crowdsourced annotations, making it more feasible to "catch up" with language drift.

AB - We present two new NER datasets for Twitter; a manually annotated set of 1, 467 tweets (k = 0.942) and a set of 2, 975 expert-corrected, crowdsourced NER annotated tweets from the dataset described in Finin et al. (2010). In our experiments with these datasets, we observe two important points: (a) language drift on Twitter is significant, and while off-the-shelf systems have been reported to perform well on in-sample data, they often perform poorly on new samples of tweets, (b) state-of-the-art performance across various datasets can be obtained from crowdsourced annotations, making it more feasible to "catch up" with language drift.

M3 - Article in proceedings

BT - Proceedings of the 9th International Conference on Language Resources and Evaluation

PB - European Language Resources Association

ER -

Crowdsourcing and annotating NER for Twitter #drift

Abstract

Fingeraftryk

Citationsformater