EMNLP@CPH: Is frequency all there is to simplicity?

Anders Trærup Johannsen, Hector Martinez Alonso, Sigrid Klerke, Anders Søgaard

5 Citationer (Scopus)

Abstract

Our system breaks down the problem of ranking a list of lexical substitutions according to how simple they are in a given context into a series of pairwise comparisons between candidates. For this we learn a binary classifier. As only very little training data is provided, we describe a procedure for generating artificial unlabeled data from Wordnet and a corpus and approach the classification task as a semisupervised machine learning problem. We use a co-Training procedure that lets each classifier increase the other classifier's training set with selected instances from an unlabeled data set. Our features include n-gram probabilities of candidate and context in a web corpus, distributional differences of candidate in a corpus of "easy" sentences and a corpus of normal sentences, syntactic complexity of documents that are similar to the given context, candidate length, and letter-wise recognizability of candidate as measured by a trigram character language model.

OriginalsprogEngelsk
TitelProceedings of SemEval-2012, 1st Joint Conference on Lexical and Computational Semantics
ForlagAssociation for Computational Linguistics
Publikationsdato2012
StatusUdgivet - 2012

Fingeraftryk

Dyk ned i forskningsemnerne om 'EMNLP@CPH: Is frequency all there is to simplicity?'. Sammen danner de et unikt fingeraftryk.

Citationsformater