Automatic training of lemmatization rules that handle morphological changes in pre-, in- and suffixes alike

Bart Jongejan; Hercules Dalianis

Automatic training of lemmatization rules that handle morphological changes in pre-, in- and suffixes alike

Bart Jongejan, Hercules Dalianis

LUKKET: Center for Sprogteknologi

36 Citationer (Scopus)

5053 Downloads (Pure)

Abstract

We propose a method to automatically train lemmatization rules that handle prefix, infix and suffix changes to generate the lemma from the full form of a word. We explain how the lemmatization rules are created and how the lemmatizer works. We trained this lemmatizer on Danish, Dutch, English, German, Greek, Icelandic, Norwegian, Polish, Slovene and
Swedish full form-lemma pairs respectively.
We obtained significant improvements of 24 percent for Polish, 2.3 percent for Dutch, 1.5 percent for English, 1.2 percent for German and 1.0 percent for Swedish compared to plain suffix lemmatization using a suffix-only lemmatizer.
Icelandic deteriorated with 1.9 percent. We also made an observation regarding the number of produced lemmatization rules as a function of the number of training pairs.

Originalsprog	Engelsk
Titel	Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP
Antal sider	9
Vol/bind	1
Forlag	Association for Computational Linguistics
Publikationsdato	2009
Sider	145-153
ISBN (Trykt)	978-1-932432-61-9
ISBN (Elektronisk)	1-932432-61-2
Status	Udgivet - 2009
Begivenhed	ACL-IJCNLP 2009 - Singapore, Singapore Varighed: 2 aug. 2009 → 7 aug. 2009 Konferencens nummer: 47

Konference

Konference	ACL-IJCNLP 2009
Nummer	47
Land/Område	Singapore
By	Singapore
Periode	02/08/2009 → 07/08/2009

Emneord

Det Humanistiske Fakultet
lemmatisering morfologi affiks

Adgang til dokumentet

ACLIJCNLP017Forlagets udgivne version, 193 KB

Citationsformater

Jongejan, B., & Dalianis, H. (2009). Automatic training of lemmatization rules that handle morphological changes in pre-, in- and suffixes alike. I Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP (Bind 1, s. 145-153). Association for Computational Linguistics.

Automatic training of lemmatization rules that handle morphological changes in pre-, in- and suffixes alike. / Jongejan, Bart; Dalianis, Hercules.
Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP. Bind 1 Association for Computational Linguistics, 2009. s. 145-153.

Publikation: Bidrag til bog/antologi/rapport › Konferencebidrag i proceedings › Forskning › peer review

Jongejan, B & Dalianis, H 2009, Automatic training of lemmatization rules that handle morphological changes in pre-, in- and suffixes alike. i Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP. bind 1, Association for Computational Linguistics, s. 145-153, ACL-IJCNLP 2009, Singapore, Singapore, 02/08/2009.

Jongejan, Bart ; Dalianis, Hercules. / Automatic training of lemmatization rules that handle morphological changes in pre-, in- and suffixes alike. Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP. Bind 1 Association for Computational Linguistics, 2009. s. 145-153

@inproceedings{16b1ab50960c11de8bc9000ea68e967b,

title = "Automatic training of lemmatization rules that handle morphological changes in pre-, in- and suffixes alike",

abstract = "We propose a method to automatically train lemmatization rules that handle prefix, infix and suffix changes to generate the lemma from the full form of a word. We explain how the lemmatization rules are created and how the lemmatizer works. We trained this lemmatizer on Danish, Dutch, English, German, Greek, Icelandic, Norwegian, Polish, Slovene andSwedish full form-lemma pairs respectively.We obtained significant improvements of 24 percent for Polish, 2.3 percent for Dutch, 1.5 percent for English, 1.2 percent for German and 1.0 percent for Swedish compared to plain suffix lemmatization using a suffix-only lemmatizer.Icelandic deteriorated with 1.9 percent. We also made an observation regarding the number of produced lemmatization rules as a function of the number of training pairs.",

keywords = "Faculty of Humanities, lemmatisering morfologi affiks, lemmatization morphology affix",

author = "Bart Jongejan and Hercules Dalianis",

year = "2009",

language = "English",

isbn = "978-1-932432-61-9",

volume = "1",

pages = "145--153",

booktitle = "Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP",

publisher = "Association for Computational Linguistics",

note = "ACL-IJCNLP 2009 ; Conference date: 02-08-2009 Through 07-08-2009",

}

TY - GEN

T1 - Automatic training of lemmatization rules that handle morphological changes in pre-, in- and suffixes alike

AU - Jongejan, Bart

AU - Dalianis, Hercules

N1 - Conference code: 47

PY - 2009

Y1 - 2009

N2 - We propose a method to automatically train lemmatization rules that handle prefix, infix and suffix changes to generate the lemma from the full form of a word. We explain how the lemmatization rules are created and how the lemmatizer works. We trained this lemmatizer on Danish, Dutch, English, German, Greek, Icelandic, Norwegian, Polish, Slovene andSwedish full form-lemma pairs respectively.We obtained significant improvements of 24 percent for Polish, 2.3 percent for Dutch, 1.5 percent for English, 1.2 percent for German and 1.0 percent for Swedish compared to plain suffix lemmatization using a suffix-only lemmatizer.Icelandic deteriorated with 1.9 percent. We also made an observation regarding the number of produced lemmatization rules as a function of the number of training pairs.

AB - We propose a method to automatically train lemmatization rules that handle prefix, infix and suffix changes to generate the lemma from the full form of a word. We explain how the lemmatization rules are created and how the lemmatizer works. We trained this lemmatizer on Danish, Dutch, English, German, Greek, Icelandic, Norwegian, Polish, Slovene andSwedish full form-lemma pairs respectively.We obtained significant improvements of 24 percent for Polish, 2.3 percent for Dutch, 1.5 percent for English, 1.2 percent for German and 1.0 percent for Swedish compared to plain suffix lemmatization using a suffix-only lemmatizer.Icelandic deteriorated with 1.9 percent. We also made an observation regarding the number of produced lemmatization rules as a function of the number of training pairs.

KW - Faculty of Humanities

KW - lemmatisering morfologi affiks

KW - lemmatization morphology affix

M3 - Article in proceedings

SN - 978-1-932432-61-9

VL - 1

SP - 145

EP - 153

BT - Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP

PB - Association for Computational Linguistics

T2 - ACL-IJCNLP 2009

Y2 - 2 August 2009 through 7 August 2009

ER -

Automatic training of lemmatization rules that handle morphological changes in pre-, in- and suffixes alike

Abstract

Konference

Emneord

Adgang til dokumentet

Fingeraftryk

Citationsformater