TY - JOUR
T1 - The Corpus of American Danish: a language resource of spoken immigrant Danish in North and South America
AU - Kühl, Karoline
AU - Heegård, Jan
AU - Hansen, Gert Foget
PY - 2020/9/1
Y1 - 2020/9/1
N2 - This paper describes the ‘Corpus of American Danish’ (CoAmDa), a newly established corpus of spoken immigrant Danish in North and South America. The CoAmDa amounts to approx. 1.7 million tokens, making it one of the largest corpora of heritage language at present. With regard to text type, the CoAmDa is a non-standard multilingual spoken language resource as Danish is mixed with American English, Canadian English or Argentine Spanish, respectively, in every recording. The aim of this note is to document relevant aspects and specifications of the CoAmDA, viz. the audio data, the sociodemographic metadata of the speakers, the digitization process of analog data, the transcription procedures, the format and tagging of the speech files and the internal validation procedures. In so doing, we wish to share our experience and best practices with regard to achieving a spoken language resource of high quality with the interested public, in particular other researchers working on and with multilingual speech corpora.
AB - This paper describes the ‘Corpus of American Danish’ (CoAmDa), a newly established corpus of spoken immigrant Danish in North and South America. The CoAmDa amounts to approx. 1.7 million tokens, making it one of the largest corpora of heritage language at present. With regard to text type, the CoAmDa is a non-standard multilingual spoken language resource as Danish is mixed with American English, Canadian English or Argentine Spanish, respectively, in every recording. The aim of this note is to document relevant aspects and specifications of the CoAmDA, viz. the audio data, the sociodemographic metadata of the speakers, the digitization process of analog data, the transcription procedures, the format and tagging of the speech files and the internal validation procedures. In so doing, we wish to share our experience and best practices with regard to achieving a spoken language resource of high quality with the interested public, in particular other researchers working on and with multilingual speech corpora.
KW - Faculty of Humanities
KW - spoken language resource
KW - language contact
KW - multilingual spoken language
KW - Danish language
KW - heritage language
KW - Corpus (creation, annotation, etc.)
KW - Corpus documentation
KW - Spoken language resource
KW - Validation procedures
KW - Heritage language
KW - Danish
KW - Multilingual spoken language
KW - Language contact
UR - https://rdcu.be/bPp7y
U2 - 10.1007/s10579-019-09473-5
DO - 10.1007/s10579-019-09473-5
M3 - Journal article
SN - 1574-020X
SP - 1
JO - Language Resources and Evaluation
JF - Language Resources and Evaluation
ER -