What’s in a p-value in NLP?

Anders Søgaard, Anders Trærup Johannsen, Barbara Plank, Dirk Hovy, Hector Martinez Alonso

11 Citations (Scopus)

Abstract

In NLP, we need to document that our proposed methods perform significantly better with respect to standard metrics than previous approaches, typically by reporting p-values obtained by rank- or randomization-based tests. We show that significance results following current research standards are unreliable and, in addition, very sensitive to sample size, covariates such as sentence length, as well as to the existence of multiple metrics. We estimate that under the assumption of perfect metrics and unbiased data, we need a significance cut-off at ~0.0025 to reduce the risk of false positive results to <5%. Since in practice we often have considerable selection bias and poor metrics, this, however, will not do alone.

Original languageEnglish
Title of host publicationEighteenth Conference on Computational Natural Language Learning : CoNLL-2014
Place of PublicationBaltimore, Maryland, USA
PublisherAssociation for Computational Linguistics
Publication date2014
Pages1-10
Publication statusPublished - 2014

Fingerprint

Dive into the research topics of 'What’s in a p-value in NLP?'. Together they form a unique fingerprint.

Cite this