Challenges of Studying and Processing Dialects in Social Media

Anna Jørgensen, Dirk Hovy, Anders Søgaard

Abstract

Dialect features typically do not make it into formal writing, but flourish in social media. This enables large-scale variational studies. We focus on three phonological features of African American Vernacular English and their manifestation as spelling variations on Twitter. We discuss to what extent our data can be used to falsify eight sociolinguistic hypotheses. To go beyond the spelling level, we require automatic analysis such as POS tagging, but social media language still challenges language technologies. We show how both newswire- and Twitter-adapted state-of-the-art POS taggers perform significantly worse on AAVE tweets, suggesting that large-scale dialect studies of language variation beyond the surface level are not feasible with out-of-the-box NLP tools.

Original languageEnglish
Title of host publicationACL 2015 Workshop on Noisy User-generated Text
Number of pages10
Place of PublicationRed Hoop, NY
PublisherAssociation for Computational Linguistics
Publication date2015
Pages9-18
ISBN (Print)978-1-941643-69-3
Publication statusPublished - 2015

Fingerprint

Dive into the research topics of 'Challenges of Studying and Processing Dialects in Social Media'. Together they form a unique fingerprint.

Cite this