Automatic training of lemmatization rules that handle morphological changes in pre-, in- and suffixes alike

Bart Jongejan, Hercules Dalianis

36 Citations (Scopus)
5053 Downloads (Pure)

Abstract

We propose a method to automatically train lemmatization rules that handle prefix, infix and suffix changes to generate the lemma from the full form of a word. We explain how the lemmatization rules are created and how the lemmatizer works. We trained this lemmatizer on Danish, Dutch, English, German, Greek, Icelandic, Norwegian, Polish, Slovene and
Swedish full form-lemma pairs respectively.
We obtained significant improvements of 24 percent for Polish, 2.3 percent for Dutch, 1.5 percent for English, 1.2 percent for German and 1.0 percent for Swedish compared to plain suffix lemmatization using a suffix-only lemmatizer.
Icelandic deteriorated with 1.9 percent. We also made an observation regarding the number of produced lemmatization rules as a function of the number of training pairs.
Original languageEnglish
Title of host publicationProceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP
Number of pages9
Volume1
PublisherAssociation for Computational Linguistics
Publication date2009
Pages145-153
ISBN (Print)978-1-932432-61-9
ISBN (Electronic)1-932432-61-2
Publication statusPublished - 2009
EventACL-IJCNLP 2009 - Singapore, Singapore
Duration: 2 Aug 20097 Aug 2009
Conference number: 47

Conference

ConferenceACL-IJCNLP 2009
Number47
Country/TerritorySingapore
CitySingapore
Period02/08/200907/08/2009

Keywords

  • Faculty of Humanities
  • lemmatization morphology affix

Fingerprint

Dive into the research topics of 'Automatic training of lemmatization rules that handle morphological changes in pre-, in- and suffixes alike'. Together they form a unique fingerprint.

Cite this