The SemDaX Corpus - sense annotations with scalable sense inventories

Bolette Sandford Pedersen; Anna Braasch; Anders Trærup Johannsen; Hector Martinez Alonso; Sanni Nimb; Sussi Olsen; Anders Søgaard; Nicolai Sørensen

The SemDaX Corpus - sense annotations with scalable sense inventories

Bolette Sandford Pedersen, Anna Braasch, Anders Trærup Johannsen, Hector Martinez Alonso, Sanni Nimb, Sussi Olsen, Anders Søgaard, Nicolai Sørensen

Department of Nordic Studies and Linguistics

4 Citations (Scopus)

Abstract

We launch the SemDaX corpus which is a recently completed Danish human-annotated corpus available through a CLARIN academic license. The corpus includes approx. 90,000 words, comprises six textual domains, and is annotated with sense inventories of different granularity. The aim of the developed corpus is twofold: i) to assess the reliability of the different sense annotation schemes for Danish measured by qualitative analyses and annotation agreement scores, and ii) to serve as training and test data for machine learning algorithms with the practical purpose of developing sense taggers for Danish. To these aims, we take a new approach to human-annotated corpus resources by double annotating a much larger part of the corpus than what is normally seen: for the all-words task we double annotated 60% of the material and for the lexical sample task 100%. We include in the corpus not only the adjucated files, but also the diverging annotations. In other words, we consider not all disagreement to be noise, but rather to contain valuable linguistic information that can help us improve our annotation schemes and our learning algorithms.

Original language	English
Title of host publication	Proceedings of the 10th conference of the Language Resources and Evaluation Conference
Number of pages	6
Place of Publication	Portorož, Slovenia.
Publisher	European Language Resources Association
Publication date	2016
Pages	842-847
ISBN (Print)	978-2-9517408-9-1
Publication status	Published - 2016

Access to Document

http://www.lrec-conf.org/proceedings/lrec2016/pdf/306_Paper.pdfLicence: CC BY-NC

Cite this

Pedersen, B. S., Braasch, A., Johannsen, A. T., Martinez Alonso, H., Nimb, S., Olsen, S., Søgaard, A., & Sørensen, N. (2016). The SemDaX Corpus - sense annotations with scalable sense inventories. In Proceedings of the 10th conference of the Language Resources and Evaluation Conference (pp. 842-847). European Language Resources Association. http://www.lrec-conf.org/proceedings/lrec2016/pdf/306_Paper.pdf

The SemDaX Corpus - sense annotations with scalable sense inventories. / Pedersen, Bolette Sandford; Braasch, Anna; Johannsen, Anders Trærup et al.
Proceedings of the 10th conference of the Language Resources and Evaluation Conference. Portorož, Slovenia.: European Language Resources Association, 2016. p. 842-847.

Research output: Chapter in Book/Report/Conference proceeding › Article in proceedings › Research › peer-review

Pedersen, BS, Braasch, A, Johannsen, AT, Martinez Alonso, H, Nimb, S, Olsen, S , Søgaard, A & Sørensen, N 2016, The SemDaX Corpus - sense annotations with scalable sense inventories. in Proceedings of the 10th conference of the Language Resources and Evaluation Conference. European Language Resources Association, Portorož, Slovenia., pp. 842-847. <http://www.lrec-conf.org/proceedings/lrec2016/pdf/306_Paper.pdf>

@inproceedings{2849181b77f84f99877bc224a750a349,

title = "The SemDaX Corpus - sense annotations with scalable sense inventories",

abstract = "We launch the SemDaX corpus which is a recently completed Danish human-annotated corpus available through a CLARIN academic license. The corpus includes approx. 90,000 words, comprises six textual domains, and is annotated with sense inventories of different granularity. The aim of the developed corpus is twofold: i) to assess the reliability of the different sense annotation schemes for Danish measured by qualitative analyses and annotation agreement scores, and ii) to serve as training and test data for machine learning algorithms with the practical purpose of developing sense taggers for Danish. To these aims, we take a new approach to human-annotated corpus resources by double annotating a much larger part of the corpus than what is normally seen: for the all-words task we double annotated 60% of the material and for the lexical sample task 100%. We include in the corpus not only the adjucated files, but also the diverging annotations. In other words, we consider not all disagreement to be noise, but rather to contain valuable linguistic information that can help us improve our annotation schemes and our learning algorithms.",

author = "Pedersen, {Bolette Sandford} and Anna Braasch and Johannsen, {Anders Tr{\ae}rup} and {Martinez Alonso}, Hector and Sanni Nimb and Sussi Olsen and Anders S{\o}gaard and Nicolai S{\o}rensen",

year = "2016",

language = "English",

isbn = "978-2-9517408-9-1",

pages = "842--847",

booktitle = "Proceedings of the 10th conference of the Language Resources and Evaluation Conference",

publisher = "European Language Resources Association",

}

TY - GEN

T1 - The SemDaX Corpus - sense annotations with scalable sense inventories

AU - Pedersen, Bolette Sandford

AU - Braasch, Anna

AU - Johannsen, Anders Trærup

AU - Martinez Alonso, Hector

AU - Nimb, Sanni

AU - Olsen, Sussi

AU - Søgaard, Anders

AU - Sørensen, Nicolai

PY - 2016

Y1 - 2016

N2 - We launch the SemDaX corpus which is a recently completed Danish human-annotated corpus available through a CLARIN academic license. The corpus includes approx. 90,000 words, comprises six textual domains, and is annotated with sense inventories of different granularity. The aim of the developed corpus is twofold: i) to assess the reliability of the different sense annotation schemes for Danish measured by qualitative analyses and annotation agreement scores, and ii) to serve as training and test data for machine learning algorithms with the practical purpose of developing sense taggers for Danish. To these aims, we take a new approach to human-annotated corpus resources by double annotating a much larger part of the corpus than what is normally seen: for the all-words task we double annotated 60% of the material and for the lexical sample task 100%. We include in the corpus not only the adjucated files, but also the diverging annotations. In other words, we consider not all disagreement to be noise, but rather to contain valuable linguistic information that can help us improve our annotation schemes and our learning algorithms.

AB - We launch the SemDaX corpus which is a recently completed Danish human-annotated corpus available through a CLARIN academic license. The corpus includes approx. 90,000 words, comprises six textual domains, and is annotated with sense inventories of different granularity. The aim of the developed corpus is twofold: i) to assess the reliability of the different sense annotation schemes for Danish measured by qualitative analyses and annotation agreement scores, and ii) to serve as training and test data for machine learning algorithms with the practical purpose of developing sense taggers for Danish. To these aims, we take a new approach to human-annotated corpus resources by double annotating a much larger part of the corpus than what is normally seen: for the all-words task we double annotated 60% of the material and for the lexical sample task 100%. We include in the corpus not only the adjucated files, but also the diverging annotations. In other words, we consider not all disagreement to be noise, but rather to contain valuable linguistic information that can help us improve our annotation schemes and our learning algorithms.

M3 - Article in proceedings

SN - 978-2-9517408-9-1

SP - 842

EP - 847

BT - Proceedings of the 10th conference of the Language Resources and Evaluation Conference

PB - European Language Resources Association

CY - Portorož, Slovenia.

ER -

The SemDaX Corpus - sense annotations with scalable sense inventories

Abstract

Access to Document

Fingerprint

Cite this