Abstract
Our system breaks down the problem of ranking a list of lexical substitutions according to how simple they are in a given context into a series of pairwise comparisons between candidates. For this we learn a binary classifier. As only very little training data is provided, we describe a procedure for generating artificial unlabeled data from Wordnet and a corpus and approach the classification task as a semisupervised machine learning problem. We use a co-Training procedure that lets each classifier increase the other classifier's training set with selected instances from an unlabeled data set. Our features include n-gram probabilities of candidate and context in a web corpus, distributional differences of candidate in a corpus of "easy" sentences and a corpus of normal sentences, syntactic complexity of documents that are similar to the given context, candidate length, and letter-wise recognizability of candidate as measured by a trigram character language model.
Originalsprog | Engelsk |
---|---|
Titel | Proceedings of SemEval-2012, 1st Joint Conference on Lexical and Computational Semantics |
Forlag | Association for Computational Linguistics |
Publikationsdato | 2012 |
Status | Udgivet - 2012 |