Do dependency parsing metrics correlate with human judgments?

Barbara Plank; Hector Martinez Alonso; Zeljko Agic; Danijela Merkler; Anders Søgaard

Do dependency parsing metrics correlate with human judgments?

Barbara Plank, Hector Martinez Alonso, Zeljko Agic, Danijela Merkler, Anders Søgaard

Centre for Language Technology

6 Citations (Scopus)

Abstract

Using automatic measures such as labeled and unlabeled attachment scores is common practice in dependency parser evaluation. In this paper, we examine whether these measures correlate with human judgments of overall parse quality. We ask linguists with experience in dependency annotation to judge system outputs. We measure the correlation between their judgments and a range of parse evaluation metrics across five languages. The human-metric correlation is lower for dependency parsing than for other NLP tasks. Also, inter-annotator agreement is sometimes higher than the agreement between judgments and metrics, indicating that the standard metrics fail to capture certain aspects of parse quality, such as the relevance of root attachment or the relative importance of the different parts of speech.

Original language	English
Title of host publication	The 19th Conference on Computational Natural Language Learning (CoNLL)
Number of pages	5
Publisher	Association for Computational Linguistics
Publication date	2015
Pages	315-320
ISBN (Print)	978-1-941643-77-8
Publication status	Published - 2015

Access to Document

https://aclweb.org/anthology/K/K15/K15-1033.pdf

Cite this

Do dependency parsing metrics correlate with human judgments? / Plank, Barbara; Martinez Alonso, Hector; Agic, Zeljko et al.
The 19th Conference on Computational Natural Language Learning (CoNLL). Association for Computational Linguistics, 2015. p. 315-320.

Research output: Chapter in Book/Report/Conference proceeding › Article in proceedings › Research › peer-review

@inproceedings{52af33d2c5b246808a6078481a96014b,

title = "Do dependency parsing metrics correlate with human judgments?",

abstract = "Using automatic measures such as labeled and unlabeled attachment scores is common practice in dependency parser evaluation. In this paper, we examine whether these measures correlate with human judgments of overall parse quality. We ask linguists with experience in dependency annotation to judge system outputs. We measure the correlation between their judgments and a range of parse evaluation metrics across five languages. The human-metric correlation is lower for dependency parsing than for other NLP tasks. Also, inter-annotator agreement is sometimes higher than the agreement between judgments and metrics, indicating that the standard metrics fail to capture certain aspects of parse quality, such as the relevance of root attachment or the relative importance of the different parts of speech.",

author = "Barbara Plank and {Martinez Alonso}, Hector and Zeljko Agic and Danijela Merkler and Anders S{\o}gaard",

year = "2015",

language = "English",

isbn = "978-1-941643-77-8",

pages = "315--320",

booktitle = "The 19th Conference on Computational Natural Language Learning (CoNLL)",