Estimating effect size across datasets

Anders Søgaard

Estimating effect size across datasets

Centre for Language Technology

8 Citations (Scopus)

Abstract

Most NLP tools are applied to text that is different from the kind of text they were evaluated on. Common evaluation practice prescribes significance testing across data points in available test data, but typically we only have a single test sample. This short paper argues that in order to assess the robustness of NLP tools we need to evaluate them on diverse samples, and we consider the problem of finding the most appropriate way to estimate the true effect size across datasets of our systems over their baselines. We apply meta-Analysis and show experimentally-by comparing estimated error reduction over observed error reduction on held-out datasets - that this method is significantly more predictive of success than the usual practice of using macro- or micro-Averages. Finally, we present a new parametric meta-Analysis based on nonstandard assumptions that seems superior to standard parametric meta-Analysis.

Original language	English
Title of host publication	The 2013 Conference of the North American Chapter of the Association for Computational Linguistics : HLT-NAACL
Publisher	Association for Computational Linguistics
Publication date	2013
Pages	607-611
ISBN (Electronic)	978-1-937284-47-3
Publication status	Published - 2013

Cite this

@inproceedings{854e167e8053497094ce3289ba987e9d,

title = "Estimating effect size across datasets",

abstract = "Most NLP tools are applied to text that is different from the kind of text they were evaluated on. Common evaluation practice prescribes significance testing across data points in available test data, but typically we only have a single test sample. This short paper argues that in order to assess the robustness of NLP tools we need to evaluate them on diverse samples, and we consider the problem of finding the most appropriate way to estimate the true effect size across datasets of our systems over their baselines. We apply meta-Analysis and show experimentally-by comparing estimated error reduction over observed error reduction on held-out datasets - that this method is significantly more predictive of success than the usual practice of using macro- or micro-Averages. Finally, we present a new parametric meta-Analysis based on nonstandard assumptions that seems superior to standard parametric meta-Analysis.",

author = "Anders S{\o}gaard",

year = "2013",

language = "English",

pages = "607--611",

booktitle = "The 2013 Conference of the North American Chapter of the Association for Computational Linguistics",

publisher = "Association for Computational Linguistics",

}

TY - GEN

T1 - Estimating effect size across datasets

AU - Søgaard, Anders

PY - 2013

Y1 - 2013

N2 - Most NLP tools are applied to text that is different from the kind of text they were evaluated on. Common evaluation practice prescribes significance testing across data points in available test data, but typically we only have a single test sample. This short paper argues that in order to assess the robustness of NLP tools we need to evaluate them on diverse samples, and we consider the problem of finding the most appropriate way to estimate the true effect size across datasets of our systems over their baselines. We apply meta-Analysis and show experimentally-by comparing estimated error reduction over observed error reduction on held-out datasets - that this method is significantly more predictive of success than the usual practice of using macro- or micro-Averages. Finally, we present a new parametric meta-Analysis based on nonstandard assumptions that seems superior to standard parametric meta-Analysis.

AB - Most NLP tools are applied to text that is different from the kind of text they were evaluated on. Common evaluation practice prescribes significance testing across data points in available test data, but typically we only have a single test sample. This short paper argues that in order to assess the robustness of NLP tools we need to evaluate them on diverse samples, and we consider the problem of finding the most appropriate way to estimate the true effect size across datasets of our systems over their baselines. We apply meta-Analysis and show experimentally-by comparing estimated error reduction over observed error reduction on held-out datasets - that this method is significantly more predictive of success than the usual practice of using macro- or micro-Averages. Finally, we present a new parametric meta-Analysis based on nonstandard assumptions that seems superior to standard parametric meta-Analysis.

M3 - Article in proceedings

SP - 607

EP - 611

BT - The 2013 Conference of the North American Chapter of the Association for Computational Linguistics

PB - Association for Computational Linguistics

ER -

Estimating effect size across datasets

Abstract

Fingerprint

Cite this