Improving historical spelling normalization with bi-directional LSTMs and multi-task learning

Marcel Bollman; Anders Søgaard

Improving historical spelling normalization with bi-directional LSTMs and multi-task learning

Marcel Bollman, Anders Søgaard

Datalogisk Institut

19 Citationer (Scopus)

59 Downloads (Pure)

Abstract

Natural-language processing of historical documents is complicated by the abundance of variant spellings and lack of annotated data. A common approach is to normalize the spelling of historical words to modern forms. We explore the suitability of a deep neural network architecture for this task, particularly a deep bi-LSTM network applied on a character level. Our model compares well to previously established normalization algorithms when evaluated on a diverse set of texts from Early New High German. We show that multi-task learning with additional normalization data can improve our model's performance further.

Originalsprog	Engelsk
Titel	The 26th International Conference on Computational Linguistics : proceedings of COLING 2016: technical Papers
Antal sider	9
Publikationsdato	2016
Sider	131-139
ISBN (Elektronisk)	978-4-87974-702-0
Status	Udgivet - 2016
Begivenhed	The 26th International Conference on Computational Linguistics - Osaka, Japan Varighed: 11 dec. 2016 → 16 dec. 2016 Konferencens nummer: 26

Konference

Konference	The 26th International Conference on Computational Linguistics
Nummer	26
Land/Område	Japan
By	Osaka
Periode	11/12/2016 → 16/12/2016

Adgang til dokumentet

Bollmann_2016_Improving_historicalForlagets udgivne version, 481 KBLicens: CC BY

http://coling2016.anlp.jp/doc/main.pdfLicens: CC BY

Citationsformater

Bollman, M & Søgaard, A 2016, Improving historical spelling normalization with bi-directional LSTMs and multi-task learning. i The 26th International Conference on Computational Linguistics: proceedings of COLING 2016: technical Papers. s. 131-139, The 26th International Conference on Computational Linguistics , Osaka, Japan, 11/12/2016. <http://coling2016.anlp.jp/doc/main.pdf>

@inproceedings{d9aefd5378c6495982ca85951b377b32,

title = "Improving historical spelling normalization with bi-directional LSTMs and multi-task learning",

abstract = "Natural-language processing of historical documents is complicated by the abundance of variant spellings and lack of annotated data. A common approach is to normalize the spelling of historical words to modern forms. We explore the suitability of a deep neural network architecture for this task, particularly a deep bi-LSTM network applied on a character level. Our model compares well to previously established normalization algorithms when evaluated on a diverse set of texts from Early New High German. We show that multi-task learning with additional normalization data can improve our model's performance further.",

author = "Marcel Bollman and Anders S{\o}gaard",

year = "2016",

language = "English",

pages = "131--139",

booktitle = "The 26th International Conference on Computational Linguistics",

note = "The 26th International Conference on Computational Linguistics ; Conference date: 11-12-2016 Through 16-12-2016",

}

TY - GEN

T1 - Improving historical spelling normalization with bi-directional LSTMs and multi-task learning

AU - Bollman, Marcel

AU - Søgaard, Anders

N1 - Conference code: 26

PY - 2016

Y1 - 2016

N2 - Natural-language processing of historical documents is complicated by the abundance of variant spellings and lack of annotated data. A common approach is to normalize the spelling of historical words to modern forms. We explore the suitability of a deep neural network architecture for this task, particularly a deep bi-LSTM network applied on a character level. Our model compares well to previously established normalization algorithms when evaluated on a diverse set of texts from Early New High German. We show that multi-task learning with additional normalization data can improve our model's performance further.

AB - Natural-language processing of historical documents is complicated by the abundance of variant spellings and lack of annotated data. A common approach is to normalize the spelling of historical words to modern forms. We explore the suitability of a deep neural network architecture for this task, particularly a deep bi-LSTM network applied on a character level. Our model compares well to previously established normalization algorithms when evaluated on a diverse set of texts from Early New High German. We show that multi-task learning with additional normalization data can improve our model's performance further.

M3 - Article in proceedings

SP - 131

EP - 139

BT - The 26th International Conference on Computational Linguistics

T2 - The 26th International Conference on Computational Linguistics

Y2 - 11 December 2016 through 16 December 2016

ER -

Improving historical spelling normalization with bi-directional LSTMs and multi-task learning

Abstract

Konference

Adgang til dokumentet

Fingeraftryk

Citationsformater