The Corpus of American Danish: a language resource of spoken immigrant Danish in North and South America

Karoline Kühl; Jan Heegård; Gert Foget Hansen

doi:10.1007/s10579-019-09473-5

The Corpus of American Danish: a language resource of spoken immigrant Danish in North and South America

Karoline Kühl, Jan Heegård, Gert Foget Hansen

1 Citationer (Scopus)

Abstract

The paper describes a newly established corpus of spoken immigrant Danish in North and South America, the Corpus of American Danish (CoAmDa). In its current state, the CoAmDa amounts to approx. 1.7 million tokens which makes it one of the largest corpora of heritage language at the present. With regard to text type, the CoAmDa can be characterized as non-standard multilingual spoken language as American English, Canadian English or Argentine Spanish, respectively, are present in the audio data and transcriptions.
The aim of this paper is to document relevant aspects and specifications of the CoAmDA, viz. the audio data combined with sociodemographic metadata on the speakers, the digitization process of analog data, the transcription procedures, the format and tagging of the speech files and the internal validation procedures that have been applied. By doing this, we share our experience and best practices with regard to achieve a spoken language resource of high quality with the interested public, in particular other researchers working on and with multilingual speech corpora.

Originalsprog	Engelsk
Tidsskrift	Language Resources and Evaluation
Sider (fra-til)	1
Antal sider	19
ISSN	1574-020X
DOI	https://doi.org/10.1007/s10579-019-09473-5
Status	Udgivet - 1 sep. 2020

Emneord

Det Humanistiske Fakultet
spoken language resource
language contact
multilingual spoken language
Danish language
heritage language
Corpus (creation, annotation, etc.)

Adgang til dokumentet

10.1007/s10579-019-09473-5

Andre filer og links

https://rdcu.be/bPp7y

Citationsformater

@article{30f23da9873247c697b4e7bd11e77e0b,

title = "The Corpus of American Danish: a language resource of spoken immigrant Danish in North and South America",

abstract = "This paper describes the {\textquoteleft}Corpus of American Danish{\textquoteright} (CoAmDa), a newly established corpus of spoken immigrant Danish in North and South America. The CoAmDa amounts to approx. 1.7 million tokens, making it one of the largest corpora of heritage language at present. With regard to text type, the CoAmDa is a non-standard multilingual spoken language resource as Danish is mixed with American English, Canadian English or Argentine Spanish, respectively, in every recording. The aim of this note is to document relevant aspects and specifications of the CoAmDA, viz. the audio data, the sociodemographic metadata of the speakers, the digitization process of analog data, the transcription procedures, the format and tagging of the speech files and the internal validation procedures. In so doing, we wish to share our experience and best practices with regard to achieving a spoken language resource of high quality with the interested public, in particular other researchers working on and with multilingual speech corpora.",

keywords = "Faculty of Humanities, spoken language resource, language contact, multilingual spoken language, Danish language, heritage language, Corpus (creation, annotation, etc.), Corpus documentation, Spoken language resource, Validation procedures, Heritage language, Danish, Multilingual spoken language, Language contact",

author = "Karoline K{\"u}hl and Jan Heeg{\aa}rd and Hansen, {Gert Foget}",

year = "2020",

month = sep,

day = "1",

doi = "10.1007/s10579-019-09473-5",

language = "English",

pages = "1",

journal = "Language Resources and Evaluation",

issn = "1574-020X",

publisher = "Springer",

}

TY - JOUR

T1 - The Corpus of American Danish: a language resource of spoken immigrant Danish in North and South America

AU - Kühl, Karoline

AU - Heegård, Jan

AU - Hansen, Gert Foget

PY - 2020/9/1

Y1 - 2020/9/1

N2 - This paper describes the ‘Corpus of American Danish’ (CoAmDa), a newly established corpus of spoken immigrant Danish in North and South America. The CoAmDa amounts to approx. 1.7 million tokens, making it one of the largest corpora of heritage language at present. With regard to text type, the CoAmDa is a non-standard multilingual spoken language resource as Danish is mixed with American English, Canadian English or Argentine Spanish, respectively, in every recording. The aim of this note is to document relevant aspects and specifications of the CoAmDA, viz. the audio data, the sociodemographic metadata of the speakers, the digitization process of analog data, the transcription procedures, the format and tagging of the speech files and the internal validation procedures. In so doing, we wish to share our experience and best practices with regard to achieving a spoken language resource of high quality with the interested public, in particular other researchers working on and with multilingual speech corpora.

AB - This paper describes the ‘Corpus of American Danish’ (CoAmDa), a newly established corpus of spoken immigrant Danish in North and South America. The CoAmDa amounts to approx. 1.7 million tokens, making it one of the largest corpora of heritage language at present. With regard to text type, the CoAmDa is a non-standard multilingual spoken language resource as Danish is mixed with American English, Canadian English or Argentine Spanish, respectively, in every recording. The aim of this note is to document relevant aspects and specifications of the CoAmDA, viz. the audio data, the sociodemographic metadata of the speakers, the digitization process of analog data, the transcription procedures, the format and tagging of the speech files and the internal validation procedures. In so doing, we wish to share our experience and best practices with regard to achieving a spoken language resource of high quality with the interested public, in particular other researchers working on and with multilingual speech corpora.

KW - Faculty of Humanities

KW - spoken language resource

KW - language contact

KW - multilingual spoken language

KW - Danish language

KW - heritage language

KW - Corpus (creation, annotation, etc.)

KW - Corpus documentation

KW - Spoken language resource

KW - Validation procedures

KW - Heritage language

KW - Danish

KW - Multilingual spoken language

KW - Language contact

UR - https://rdcu.be/bPp7y

U2 - 10.1007/s10579-019-09473-5

DO - 10.1007/s10579-019-09473-5

M3 - Journal article

SN - 1574-020X

SP - 1

JO - Language Resources and Evaluation

JF - Language Resources and Evaluation

ER -

The Corpus of American Danish: a language resource of spoken immigrant Danish in North and South America

Abstract

Emneord

Adgang til dokumentet

Andre filer og links

Fingeraftryk

Citationsformater