Challenges of Studying and Processing Dialects in Social Media

Anna Jørgensen, Dirk Hovy, Anders Søgaard

Abstract

Dialect features typically do not make it into formal writing, but flourish in social media. This enables large-scale variational studies. We focus on three phonological features of African American Vernacular English and their manifestation as spelling variations on Twitter. We discuss to what extent our data can be used to falsify eight sociolinguistic hypotheses. To go beyond the spelling level, we require automatic analysis such as POS tagging, but social media language still challenges language technologies. We show how both newswire- and Twitter-adapted state-of-the-art POS taggers perform significantly worse on AAVE tweets, suggesting that large-scale dialect studies of language variation beyond the surface level are not feasible with out-of-the-box NLP tools.

OriginalsprogEngelsk
TitelACL 2015 Workshop on Noisy User-generated Text
Antal sider10
UdgivelsesstedRed Hoop, NY
ForlagAssociation for Computational Linguistics
Publikationsdato2015
Sider9-18
ISBN (Trykt)978-1-941643-69-3
StatusUdgivet - 2015

Fingeraftryk

Dyk ned i forskningsemnerne om 'Challenges of Studying and Processing Dialects in Social Media'. Sammen danner de et unikt fingeraftryk.

Citationsformater