Abstract
Most NLP systems use tokenization as part of preprocessing. Generally, tokenizers are based on simple heuristics and do not recognize multi-word units (MWUs) like hot dog or black hole unless a precompiled list of MWUs is available. In this paper, we propose a new cascaded model for detecting MWUs of arbitrary length for tokenization, focusing on noun phrases in the physics domain. We adopt a classification approach because - unlike other work on MWUs - tokenization requires a completely automatic approach. We achieve an accuracy of 68% for recognizing non-compositional MWUs and show that our MWU recognizer improves retrieval performance when used as part of an information retrieval system.
Original language | English |
---|---|
Title of host publication | EMNLP 2011 - Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference |
Number of pages | 11 |
Publication date | 1 Jan 2011 |
Pages | 793-803 |
ISBN (Print) | 9781937284114 |
Publication status | Published - 1 Jan 2011 |
Externally published | Yes |