CoCoScore: Context-aware co-occurrence scoring for text mining applications using distant supervision

Alexander Junge; Lars Juhl Jensen

doi:10.1093/bioinformatics/btz490

CoCoScore: Context-aware co-occurrence scoring for text mining applications using distant supervision

Alexander Junge, Lars Juhl Jensen

Disease Systems Biology Program

3 Citationer (Scopus)

3 Downloads (Pure)

Abstract

Motivation: Information extraction by mining the scientific literature is key to uncovering relations between biomedical entities. Most existing approaches based on natural language processing extract relations from single sentence-level co-mentions, ignoring co-occurrence statistics over the whole corpus. Existing approaches counting entity co-occurrences ignore the textual context of each co-occurrence. Results: We propose a novel corpus-wide co-occurrence scoring approach to relation extraction that takes the textual context of each co-mention into account. Our method, called CoCoScore, scores the certainty of stating an association for each sentence that co-mentions two entities. CoCoScore is trained using distant supervision based on a gold-standard set of associations between entities of interest. Instead of requiring a manually annotated training corpus, co-mentions are labeled as positives/negatives according to their presence/absence in the gold standard. We show that CoCoScore outperforms previous approaches in identifying human disease-gene and tissue-gene associations as well as in identifying physical and functional protein-protein associations in different species. CoCoScore is a versatile text mining tool to uncover pairwise associations via co-occurrence mining, within and beyond biomedical applications.

Originalsprog	Engelsk
Tidsskrift	Bioinformatics
Antal sider	8
ISSN	1367-4803
DOI	https://doi.org/10.1093/bioinformatics/btz490
Status	Udgivet - 1 jan. 2020

FN’s Verdensmål

Dette resultat bidrager til følgende verdensmål

Adgang til dokumentet

10.1093/bioinformatics/btz490Licens: CC BY

CoCoScore: context-aware co-occurrence scoring for text mining applications using distant supervisionForlagets udgivne version, 1,09 MBLicens: CC BY

Citationsformater

@article{eb7bccc429744c10897bca2590bbe4b0,

title = "CoCoScore: Context-aware co-occurrence scoring for text mining applications using distant supervision",

abstract = "Motivation: Information extraction by mining the scientific literature is key to uncovering relations between biomedical entities. Most existing approaches based on natural language processing extract relations from single sentence-level co-mentions, ignoring co-occurrence statistics over the whole corpus. Existing approaches counting entity co-occurrences ignore the textual context of each co-occurrence. Results: We propose a novel corpus-wide co-occurrence scoring approach to relation extraction that takes the textual context of each co-mention into account. Our method, called CoCoScore, scores the certainty of stating an association for each sentence that co-mentions two entities. CoCoScore is trained using distant supervision based on a gold-standard set of associations between entities of interest. Instead of requiring a manually annotated training corpus, co-mentions are labeled as positives/negatives according to their presence/absence in the gold standard. We show that CoCoScore outperforms previous approaches in identifying human disease-gene and tissue-gene associations as well as in identifying physical and functional protein-protein associations in different species. CoCoScore is a versatile text mining tool to uncover pairwise associations via co-occurrence mining, within and beyond biomedical applications.",

author = "Alexander Junge and Jensen, {Lars Juhl}",

year = "2020",

month = jan,

day = "1",

doi = "10.1093/bioinformatics/btz490",

language = "English",

journal = "Bioinformatics",

issn = "1367-4803",

publisher = "Oxford University Press",

}

TY - JOUR

T1 - CoCoScore

T2 - Context-aware co-occurrence scoring for text mining applications using distant supervision

AU - Junge, Alexander

AU - Jensen, Lars Juhl

PY - 2020/1/1

Y1 - 2020/1/1

N2 - Motivation: Information extraction by mining the scientific literature is key to uncovering relations between biomedical entities. Most existing approaches based on natural language processing extract relations from single sentence-level co-mentions, ignoring co-occurrence statistics over the whole corpus. Existing approaches counting entity co-occurrences ignore the textual context of each co-occurrence. Results: We propose a novel corpus-wide co-occurrence scoring approach to relation extraction that takes the textual context of each co-mention into account. Our method, called CoCoScore, scores the certainty of stating an association for each sentence that co-mentions two entities. CoCoScore is trained using distant supervision based on a gold-standard set of associations between entities of interest. Instead of requiring a manually annotated training corpus, co-mentions are labeled as positives/negatives according to their presence/absence in the gold standard. We show that CoCoScore outperforms previous approaches in identifying human disease-gene and tissue-gene associations as well as in identifying physical and functional protein-protein associations in different species. CoCoScore is a versatile text mining tool to uncover pairwise associations via co-occurrence mining, within and beyond biomedical applications.

AB - Motivation: Information extraction by mining the scientific literature is key to uncovering relations between biomedical entities. Most existing approaches based on natural language processing extract relations from single sentence-level co-mentions, ignoring co-occurrence statistics over the whole corpus. Existing approaches counting entity co-occurrences ignore the textual context of each co-occurrence. Results: We propose a novel corpus-wide co-occurrence scoring approach to relation extraction that takes the textual context of each co-mention into account. Our method, called CoCoScore, scores the certainty of stating an association for each sentence that co-mentions two entities. CoCoScore is trained using distant supervision based on a gold-standard set of associations between entities of interest. Instead of requiring a manually annotated training corpus, co-mentions are labeled as positives/negatives according to their presence/absence in the gold standard. We show that CoCoScore outperforms previous approaches in identifying human disease-gene and tissue-gene associations as well as in identifying physical and functional protein-protein associations in different species. CoCoScore is a versatile text mining tool to uncover pairwise associations via co-occurrence mining, within and beyond biomedical applications.

U2 - 10.1093/bioinformatics/btz490

DO - 10.1093/bioinformatics/btz490

M3 - Journal article

C2 - 31199464

SN - 1367-4803

JO - Bioinformatics

JF - Bioinformatics

ER -

CoCoScore: Context-aware co-occurrence scoring for text mining applications using distant supervision

Abstract

FN’s Verdensmål

Adgang til dokumentet

Fingeraftryk

Citationsformater