Quality indicators of LSP texts – selection and measurements: Measuring the terminological usefulness of documents for an LSP corpus

Jakob Halskov; Dorte Haltrup Hansen; Anna Braasch; Sussi Olsen

Quality indicators of LSP texts – selection and measurements: Measuring the terminological usefulness of documents for an LSP corpus

Jakob Halskov, Dorte Haltrup Hansen, Anna Braasch, Sussi Olsen

Centre for Language Technology

Abstract

This paper describes and evaluates a prototype quality assurance system for LSP corpora. The system will be employed in compiling a corpus of 11 M tokens for various linguistic and terminological purposes. The system utilizes a number of linguistic features as quality indicators. These represent two dimensions of quality, namely readability/formality (e.g. word length and passive constructions) and density of specialized knowledge (e.g. out-of-vocabulary items). Threshold values for each indicator are induced from a reference corpus of general (fiction, magazines and newspapers) and specialized language (the domains of Health/Medicine and Environment/Climate). In order to test the efficiency of the indicators, a number of terminologically relevant, irrelevant and possibly relevant texts are manually selected from target web sites as candidate texts. By applying the indicators to these candidate texts, the system is able to filter out non-LSP and "poor" LSP texts with a precision of 100% and a recall of 55%. Thus, the experiment described in this paper constitutes fundamental work towards a formulation of 'best practice' for implementing quality assurance when selecting appropriate texts for an LSP corpus. The domain independence of the quality indicators still remains to be thoroughly tested on more than just two domains.

Original language	English
Title of host publication	Proceedings of the Seventh International Conference on Language Resources and Evaluation : LREC 2010
Number of pages	7
Place of Publication	Valletta, Malta
Publisher	European Language Resources Association
Publication date	2010
Pages	2614-2620
ISBN (Electronic)	2-9517408-6-7
Publication status	Published - 2010
Event	7th Language Resources and Evaluation Conference - Valletta, Malta Duration: 17 May 2010 → 23 May 2010

Conference

Conference	7th Language Resources and Evaluation Conference
Country/Territory	Malta
City	Valletta
Period	17/05/2010 → 23/05/2010

Cite this

Quality indicators of LSP texts – selection and measurements: Measuring the terminological usefulness of documents for an LSP corpus. / Halskov, Jakob; Hansen, Dorte Haltrup; Braasch, Anna et al.
Proceedings of the Seventh International Conference on Language Resources and Evaluation: LREC 2010. Valletta, Malta: European Language Resources Association, 2010. p. 2614-2620.

Research output: Chapter in Book/Report/Conference proceeding › Article in proceedings › Research › peer-review

Halskov, J, Hansen, DH, Braasch, A & Olsen, S 2010, Quality indicators of LSP texts – selection and measurements: Measuring the terminological usefulness of documents for an LSP corpus. in Proceedings of the Seventh International Conference on Language Resources and Evaluation: LREC 2010. European Language Resources Association, Valletta, Malta, pp. 2614-2620, 7th Language Resources and Evaluation Conference, Valletta, Malta, 17/05/2010.

Halskov, Jakob ; Hansen, Dorte Haltrup ; Braasch, Anna et al. / Quality indicators of LSP texts – selection and measurements : Measuring the terminological usefulness of documents for an LSP corpus. Proceedings of the Seventh International Conference on Language Resources and Evaluation: LREC 2010. Valletta, Malta : European Language Resources Association, 2010. pp. 2614-2620

@inproceedings{f4c451de2916481191c4ab7ca130d83d,

title = "Quality indicators of LSP texts – selection and measurements: Measuring the terminological usefulness of documents for an LSP corpus",

abstract = "This paper describes and evaluates a prototype quality assurance system for LSP corpora. The system will be employed in compiling a corpus of 11 M tokens for various linguistic and terminological purposes. The system utilizes a number of linguistic features as quality indicators. These represent two dimensions of quality, namely readability/formality (e.g. word length and passive constructions) and density of specialized knowledge (e.g. out-of-vocabulary items). Threshold values for each indicator are induced from a reference corpus of general (fiction, magazines and newspapers) and specialized language (the domains of Health/Medicine and Environment/Climate). In order to test the efficiency of the indicators, a number of terminologically relevant, irrelevant and possibly relevant texts are manually selected from target web sites as candidate texts. By applying the indicators to these candidate texts, the system is able to filter out non-LSP and {"}poor{"} LSP texts with a precision of 100% and a recall of 55%. Thus, the experiment described in this paper constitutes fundamental work towards a formulation of 'best practice' for implementing quality assurance when selecting appropriate texts for an LSP corpus. The domain independence of the quality indicators still remains to be thoroughly tested on more than just two domains.",

author = "Jakob Halskov and Hansen, {Dorte Haltrup} and Anna Braasch and Sussi Olsen",

year = "2010",

language = "English",

pages = "2614--2620",

booktitle = "Proceedings of the Seventh International Conference on Language Resources and Evaluation",

publisher = "European Language Resources Association",

note = "7th Language Resources and Evaluation Conference ; Conference date: 17-05-2010 Through 23-05-2010",

}

TY - GEN

T1 - Quality indicators of LSP texts – selection and measurements

T2 - 7th Language Resources and Evaluation Conference

AU - Halskov, Jakob

AU - Hansen, Dorte Haltrup

AU - Braasch, Anna

AU - Olsen, Sussi

PY - 2010

Y1 - 2010

N2 - This paper describes and evaluates a prototype quality assurance system for LSP corpora. The system will be employed in compiling a corpus of 11 M tokens for various linguistic and terminological purposes. The system utilizes a number of linguistic features as quality indicators. These represent two dimensions of quality, namely readability/formality (e.g. word length and passive constructions) and density of specialized knowledge (e.g. out-of-vocabulary items). Threshold values for each indicator are induced from a reference corpus of general (fiction, magazines and newspapers) and specialized language (the domains of Health/Medicine and Environment/Climate). In order to test the efficiency of the indicators, a number of terminologically relevant, irrelevant and possibly relevant texts are manually selected from target web sites as candidate texts. By applying the indicators to these candidate texts, the system is able to filter out non-LSP and "poor" LSP texts with a precision of 100% and a recall of 55%. Thus, the experiment described in this paper constitutes fundamental work towards a formulation of 'best practice' for implementing quality assurance when selecting appropriate texts for an LSP corpus. The domain independence of the quality indicators still remains to be thoroughly tested on more than just two domains.

AB - This paper describes and evaluates a prototype quality assurance system for LSP corpora. The system will be employed in compiling a corpus of 11 M tokens for various linguistic and terminological purposes. The system utilizes a number of linguistic features as quality indicators. These represent two dimensions of quality, namely readability/formality (e.g. word length and passive constructions) and density of specialized knowledge (e.g. out-of-vocabulary items). Threshold values for each indicator are induced from a reference corpus of general (fiction, magazines and newspapers) and specialized language (the domains of Health/Medicine and Environment/Climate). In order to test the efficiency of the indicators, a number of terminologically relevant, irrelevant and possibly relevant texts are manually selected from target web sites as candidate texts. By applying the indicators to these candidate texts, the system is able to filter out non-LSP and "poor" LSP texts with a precision of 100% and a recall of 55%. Thus, the experiment described in this paper constitutes fundamental work towards a formulation of 'best practice' for implementing quality assurance when selecting appropriate texts for an LSP corpus. The domain independence of the quality indicators still remains to be thoroughly tested on more than just two domains.

M3 - Article in proceedings

SP - 2614

EP - 2620

BT - Proceedings of the Seventh International Conference on Language Resources and Evaluation

PB - European Language Resources Association

CY - Valletta, Malta

Y2 - 17 May 2010 through 23 May 2010

ER -

Quality indicators of LSP texts – selection and measurements: Measuring the terminological usefulness of documents for an LSP corpus

Abstract

Conference

Fingerprint

Cite this