Abstract
In NLP, we need to document that our proposed methods perform significantly better with respect to standard metrics than previous approaches, typically by reporting p-values obtained by rank- or randomization-based tests. We show that significance results following current research standards are unreliable and, in addition, very sensitive to sample size, covariates such as sentence length, as well as to the existence of multiple metrics. We estimate that under the assumption of perfect metrics and unbiased data, we need a significance cut-off at ~0.0025 to reduce the risk of false positive results to <5%. Since in practice we often have considerable selection bias and poor metrics, this, however, will not do alone.
Original language | English |
---|---|
Title of host publication | Eighteenth Conference on Computational Natural Language Learning : CoNLL-2014 |
Place of Publication | Baltimore, Maryland, USA |
Publisher | Association for Computational Linguistics |
Publication date | 2014 |
Pages | 1-10 |
Publication status | Published - 2014 |