LocText: relation extraction of protein localizations to assist database curation

Juan Miguel Cejuela; Shrikant Vinchurkar; Tatyana Goldberg; Madhukar Sollepura Prabhu Shankar; Ashish Baghudana; Aleksandar Bojchevski; Carsten Uhlig; André Ofner; Pandu Raharja-Liu; Lars Juhl Jensen; Burkhard Rost

doi:10.1186/s12859-018-2021-9

LocText: relation extraction of protein localizations to assist database curation

Juan Miguel Cejuela, Shrikant Vinchurkar, Tatyana Goldberg, Madhukar Sollepura Prabhu Shankar, Ashish Baghudana, Aleksandar Bojchevski, Carsten Uhlig, André Ofner, Pandu Raharja-Liu, Lars Juhl Jensen, Burkhard Rost

Disease Systems Biology Program

8 Citationer (Scopus)

55 Downloads (Pure)

Abstract

BACKGROUND: The subcellular localization of a protein is an important aspect of its function. However, the experimental annotation of locations is not even complete for well-studied model organisms. Text mining might aid database curators to add experimental annotations from the scientific literature. Existing extraction methods have difficulties to distinguish relationships between proteins and cellular locations co-mentioned in the same sentence.

RESULTS: LocText was created as a new method to extract protein locations from abstracts and full texts. LocText learned patterns from syntax parse trees and was trained and evaluated on a newly improved LocTextCorpus. Combined with an automatic named-entity recognizer, LocText achieved high precision (P = 86%±4). After completing development, we mined the latest research publications for three organisms: human (Homo sapiens), budding yeast (Saccharomyces cerevisiae), and thale cress (Arabidopsis thaliana). Examining 60 novel, text-mined annotations, we found that 65% (human), 85% (yeast), and 80% (cress) were correct. Of all validated annotations, 40% were completely novel, i.e. did neither appear in the annotations nor the text descriptions of Swiss-Prot.

CONCLUSIONS: LocText provides a cost-effective, semi-automated workflow to assist database curators in identifying novel protein localization annotations. The annotations suggested through text-mining would be verified by experts to guarantee high-quality standards of manually-curated databases such as Swiss-Prot.

Originalsprog	Engelsk
Artikelnummer	15
Tidsskrift	BMC Bioinformatics
Vol/bind	19
Sider (fra-til)	1-11
ISSN	1471-2105
DOI	https://doi.org/10.1186/s12859-018-2021-9
Status	Udgivet - 17 jan. 2018

Adgang til dokumentet

10.1186/s12859-018-2021-9

LocText: relation extraction of protein localizations to assist database curationForlagets udgivne version, 814 KB

Citationsformater

@article{c0ba217e3b3a4b889b5e8ed828ff4e2a,

title = "LocText: relation extraction of protein localizations to assist database curation",

abstract = "BACKGROUND: The subcellular localization of a protein is an important aspect of its function. However, the experimental annotation of locations is not even complete for well-studied model organisms. Text mining might aid database curators to add experimental annotations from the scientific literature. Existing extraction methods have difficulties to distinguish relationships between proteins and cellular locations co-mentioned in the same sentence.RESULTS: LocText was created as a new method to extract protein locations from abstracts and full texts. LocText learned patterns from syntax parse trees and was trained and evaluated on a newly improved LocTextCorpus. Combined with an automatic named-entity recognizer, LocText achieved high precision (P = 86%±4). After completing development, we mined the latest research publications for three organisms: human (Homo sapiens), budding yeast (Saccharomyces cerevisiae), and thale cress (Arabidopsis thaliana). Examining 60 novel, text-mined annotations, we found that 65% (human), 85% (yeast), and 80% (cress) were correct. Of all validated annotations, 40% were completely novel, i.e. did neither appear in the annotations nor the text descriptions of Swiss-Prot.CONCLUSIONS: LocText provides a cost-effective, semi-automated workflow to assist database curators in identifying novel protein localization annotations. The annotations suggested through text-mining would be verified by experts to guarantee high-quality standards of manually-curated databases such as Swiss-Prot.",

author = "Cejuela, {Juan Miguel} and Shrikant Vinchurkar and Tatyana Goldberg and {Prabhu Shankar}, {Madhukar Sollepura} and Ashish Baghudana and Aleksandar Bojchevski and Carsten Uhlig and Andr{\'e} Ofner and Pandu Raharja-Liu and Jensen, {Lars Juhl} and Burkhard Rost",

year = "2018",

month = jan,

day = "17",

doi = "10.1186/s12859-018-2021-9",

language = "English",

volume = "19",

pages = "1--11",

journal = "BMC Bioinformatics",

issn = "1471-2105",

publisher = "BioMed Central Ltd.",

}

TY - JOUR

T1 - LocText

T2 - relation extraction of protein localizations to assist database curation

AU - Cejuela, Juan Miguel

AU - Vinchurkar, Shrikant

AU - Goldberg, Tatyana

AU - Prabhu Shankar, Madhukar Sollepura

AU - Baghudana, Ashish

AU - Bojchevski, Aleksandar

AU - Uhlig, Carsten

AU - Ofner, André

AU - Raharja-Liu, Pandu

AU - Jensen, Lars Juhl

AU - Rost, Burkhard

PY - 2018/1/17

Y1 - 2018/1/17

N2 - BACKGROUND: The subcellular localization of a protein is an important aspect of its function. However, the experimental annotation of locations is not even complete for well-studied model organisms. Text mining might aid database curators to add experimental annotations from the scientific literature. Existing extraction methods have difficulties to distinguish relationships between proteins and cellular locations co-mentioned in the same sentence.RESULTS: LocText was created as a new method to extract protein locations from abstracts and full texts. LocText learned patterns from syntax parse trees and was trained and evaluated on a newly improved LocTextCorpus. Combined with an automatic named-entity recognizer, LocText achieved high precision (P = 86%±4). After completing development, we mined the latest research publications for three organisms: human (Homo sapiens), budding yeast (Saccharomyces cerevisiae), and thale cress (Arabidopsis thaliana). Examining 60 novel, text-mined annotations, we found that 65% (human), 85% (yeast), and 80% (cress) were correct. Of all validated annotations, 40% were completely novel, i.e. did neither appear in the annotations nor the text descriptions of Swiss-Prot.CONCLUSIONS: LocText provides a cost-effective, semi-automated workflow to assist database curators in identifying novel protein localization annotations. The annotations suggested through text-mining would be verified by experts to guarantee high-quality standards of manually-curated databases such as Swiss-Prot.

AB - BACKGROUND: The subcellular localization of a protein is an important aspect of its function. However, the experimental annotation of locations is not even complete for well-studied model organisms. Text mining might aid database curators to add experimental annotations from the scientific literature. Existing extraction methods have difficulties to distinguish relationships between proteins and cellular locations co-mentioned in the same sentence.RESULTS: LocText was created as a new method to extract protein locations from abstracts and full texts. LocText learned patterns from syntax parse trees and was trained and evaluated on a newly improved LocTextCorpus. Combined with an automatic named-entity recognizer, LocText achieved high precision (P = 86%±4). After completing development, we mined the latest research publications for three organisms: human (Homo sapiens), budding yeast (Saccharomyces cerevisiae), and thale cress (Arabidopsis thaliana). Examining 60 novel, text-mined annotations, we found that 65% (human), 85% (yeast), and 80% (cress) were correct. Of all validated annotations, 40% were completely novel, i.e. did neither appear in the annotations nor the text descriptions of Swiss-Prot.CONCLUSIONS: LocText provides a cost-effective, semi-automated workflow to assist database curators in identifying novel protein localization annotations. The annotations suggested through text-mining would be verified by experts to guarantee high-quality standards of manually-curated databases such as Swiss-Prot.

U2 - 10.1186/s12859-018-2021-9

DO - 10.1186/s12859-018-2021-9

M3 - Journal article

C2 - 29343218

SN - 1471-2105

VL - 19

SP - 1

EP - 11

JO - BMC Bioinformatics

JF - BMC Bioinformatics

M1 - 15

ER -

LocText: relation extraction of protein localizations to assist database curation

Abstract

Adgang til dokumentet

Fingeraftryk

Citationsformater