Abstract
Most NLP systems use tokenization as part of preprocessing. Generally, tokenizers are based on simple heuristics and do not recognize multi-word units (MWUs) like hot dog or black hole unless a precompiled list of MWUs is available. In this paper, we propose a new cascaded model for detecting MWUs of arbitrary length for tokenization, focusing on noun phrases in the physics domain. We adopt a classification approach because - unlike other work on MWUs - tokenization requires a completely automatic approach. We achieve an accuracy of 68% for recognizing non-compositional MWUs and show that our MWU recognizer improves retrieval performance when used as part of an information retrieval system.
Originalsprog | Engelsk |
---|---|
Titel | EMNLP 2011 - Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference |
Antal sider | 11 |
Publikationsdato | 1 jan. 2011 |
Sider | 793-803 |
ISBN (Trykt) | 9781937284114 |
Status | Udgivet - 1 jan. 2011 |
Udgivet eksternt | Ja |