Abstract
Most NLP tools are applied to text that is different from the kind of text they were evaluated on. Common evaluation practice prescribes significance testing across data points in available test data, but typically we only have a single test sample. This short paper argues that in order to assess the robustness of NLP tools we need to evaluate them on diverse samples, and we consider the problem of finding the most appropriate way to estimate the true effect size across datasets of our systems over their baselines. We apply meta-Analysis and show experimentally-by comparing estimated error reduction over observed error reduction on held-out datasets - that this method is significantly more predictive of success than the usual practice of using macro- or micro-Averages. Finally, we present a new parametric meta-Analysis based on nonstandard assumptions that seems superior to standard parametric meta-Analysis.
Original language | English |
---|---|
Title of host publication | The 2013 Conference of the North American Chapter of the Association for Computational Linguistics : HLT-NAACL |
Publisher | Association for Computational Linguistics |
Publication date | 2013 |
Pages | 607-611 |
ISBN (Electronic) | 978-1-937284-47-3 |
Publication status | Published - 2013 |