EMNLP@CPH: Is frequency all there is to simplicity?

Anders Trærup Johannsen; Hector Martinez Alonso; Sigrid Klerke; Anders Søgaard

EMNLP@CPH: Is frequency all there is to simplicity?

Anders Trærup Johannsen, Hector Martinez Alonso, Sigrid Klerke, Anders Søgaard

LUKKET: Center for Sprogteknologi

5 Citationer (Scopus)

Abstract

Our system breaks down the problem of ranking a list of lexical substitutions according to how simple they are in a given context into a series of pairwise comparisons between candidates. For this we learn a binary classifier. As only very little training data is provided, we describe a procedure for generating artificial unlabeled data from Wordnet and a corpus and approach the classification task as a semisupervised machine learning problem. We use a co-Training procedure that lets each classifier increase the other classifier's training set with selected instances from an unlabeled data set. Our features include n-gram probabilities of candidate and context in a web corpus, distributional differences of candidate in a corpus of "easy" sentences and a corpus of normal sentences, syntactic complexity of documents that are similar to the given context, candidate length, and letter-wise recognizability of candidate as measured by a trigram character language model.

Originalsprog	Engelsk
Titel	Proceedings of SemEval-2012, 1st Joint Conference on Lexical and Computational Semantics
Forlag	Association for Computational Linguistics
Publikationsdato	2012
Status	Udgivet - 2012

Citationsformater

EMNLP@CPH: Is frequency all there is to simplicity? / Johannsen, Anders Trærup; Martinez Alonso, Hector; Klerke, Sigrid et al.

Proceedings of SemEval-2012, 1st Joint Conference on Lexical and Computational Semantics. Association for Computational Linguistics, 2012.

Publikation: Bidrag til bog/antologi/rapport › Konferencebidrag i proceedings › Forskning › peer review

@inproceedings{2dc2e56a5b4849d9a2c5fbbe00e7bbac,

title = "EMNLP@CPH: Is frequency all there is to simplicity?",

abstract = "Our system breaks down the problem of ranking a list of lexical substitutions according to how simple they are in a given context into a series of pairwise comparisons between candidates. For this we learn a binary classifier. As only very little training data is provided, we describe a procedure for generating artificial unlabeled data from Wordnet and a corpus and approach the classification task as a semisupervised machine learning problem. We use a co-Training procedure that lets each classifier increase the other classifier's training set with selected instances from an unlabeled data set. Our features include n-gram probabilities of candidate and context in a web corpus, distributional differences of candidate in a corpus of {"}easy{"} sentences and a corpus of normal sentences, syntactic complexity of documents that are similar to the given context, candidate length, and letter-wise recognizability of candidate as measured by a trigram character language model.",

author = "Johannsen, {Anders Tr{\ae}rup} and {Martinez Alonso}, Hector and Sigrid Klerke and Anders S{\o}gaard",

year = "2012",

language = "English",

booktitle = "Proceedings of SemEval-2012, 1st Joint Conference on Lexical and Computational Semantics",