A comprehensive and quantitative comparison of text-mining in 15 million full-text articles versus their corresponding abstracts

David Westergaard; Hans-Henrik Stærfeldt; Christian Tønsberg; Lars Juhl Jensen; Søren Brunak

doi:10.1371/journal.pcbi.1005962

A comprehensive and quantitative comparison of text-mining in 15 million full-text articles versus their corresponding abstracts

David Westergaard, Hans-Henrik Stærfeldt, Christian Tønsberg, Lars Juhl Jensen, Søren Brunak

Disease Systems Biology Program

53 Citations (Scopus)

81 Downloads (Pure)

Abstract

Across academia and industry, text mining has become a popular strategy for keeping up with the rapid growth of the scientific literature. Text mining of the scientific literature has mostly been carried out on collections of abstracts, due to their availability. Here we present an analysis of 15 million English scientific full-text articles published during the period 1823-2016. We describe the development in article length and publication sub-topics during these nearly 250 years. We showcase the potential of text mining by extracting published protein-protein, disease-gene, and protein subcellular associations using a named entity recognition system, and quantitatively report on their accuracy using gold standard benchmark data sets. We subsequently compare the findings to corresponding results obtained on 16.5 million abstracts included in MEDLINE and show that text mining of full-text articles consistently outperforms using abstracts only.

Original language	English
Article number	e1005962
Journal	PLoS Computational Biology
Volume	14
Issue number	2
Number of pages	16
ISSN	1553-7358
DOIs	https://doi.org/10.1371/journal.pcbi.1005962
Publication status	Published - 2018

Keywords

Abstracting and Indexing as Topic
Area Under Curve
Computational Biology/methods
Data Mining/methods
False Positive Reactions
Genes
Information Storage and Retrieval
MEDLINE
Periodicals as Topic
Proteins/genetics
ROC Curve
Software
Terminology as Topic

Access to Document

10.1371/journal.pcbi.1005962Licence: CC BY

A comprehensive and quantitative comparison of text-mining in 15 million fulltext articles versus their corresponding abstractsFinal published version, 3.97 MBLicence: CC BY

Cite this

@article{1d40ef32b0fb48c8989c3ba2f1a5cee6,

title = "A comprehensive and quantitative comparison of text-mining in 15 million full-text articles versus their corresponding abstracts",

abstract = "Across academia and industry, text mining has become a popular strategy for keeping up with the rapid growth of the scientific literature. Text mining of the scientific literature has mostly been carried out on collections of abstracts, due to their availability. Here we present an analysis of 15 million English scientific full-text articles published during the period 1823-2016. We describe the development in article length and publication sub-topics during these nearly 250 years. We showcase the potential of text mining by extracting published protein-protein, disease-gene, and protein subcellular associations using a named entity recognition system, and quantitatively report on their accuracy using gold standard benchmark data sets. We subsequently compare the findings to corresponding results obtained on 16.5 million abstracts included in MEDLINE and show that text mining of full-text articles consistently outperforms using abstracts only.",

keywords = "Abstracting and Indexing as Topic, Area Under Curve, Computational Biology/methods, Data Mining/methods, False Positive Reactions, Genes, Information Storage and Retrieval, MEDLINE, Periodicals as Topic, Proteins/genetics, ROC Curve, Software, Terminology as Topic",

author = "David Westergaard and Hans-Henrik St{\ae}rfeldt and Christian T{\o}nsberg and Jensen, {Lars Juhl} and S{\o}ren Brunak",

year = "2018",

doi = "10.1371/journal.pcbi.1005962",

language = "English",

volume = "14",

journal = "PLoS Computational Biology",

issn = "1553-7358",

publisher = "Public Library of Science",

number = "2",

}

TY - JOUR

T1 - A comprehensive and quantitative comparison of text-mining in 15 million full-text articles versus their corresponding abstracts

AU - Westergaard, David

AU - Stærfeldt, Hans-Henrik

AU - Tønsberg, Christian

AU - Jensen, Lars Juhl

AU - Brunak, Søren

PY - 2018

Y1 - 2018

N2 - Across academia and industry, text mining has become a popular strategy for keeping up with the rapid growth of the scientific literature. Text mining of the scientific literature has mostly been carried out on collections of abstracts, due to their availability. Here we present an analysis of 15 million English scientific full-text articles published during the period 1823-2016. We describe the development in article length and publication sub-topics during these nearly 250 years. We showcase the potential of text mining by extracting published protein-protein, disease-gene, and protein subcellular associations using a named entity recognition system, and quantitatively report on their accuracy using gold standard benchmark data sets. We subsequently compare the findings to corresponding results obtained on 16.5 million abstracts included in MEDLINE and show that text mining of full-text articles consistently outperforms using abstracts only.

AB - Across academia and industry, text mining has become a popular strategy for keeping up with the rapid growth of the scientific literature. Text mining of the scientific literature has mostly been carried out on collections of abstracts, due to their availability. Here we present an analysis of 15 million English scientific full-text articles published during the period 1823-2016. We describe the development in article length and publication sub-topics during these nearly 250 years. We showcase the potential of text mining by extracting published protein-protein, disease-gene, and protein subcellular associations using a named entity recognition system, and quantitatively report on their accuracy using gold standard benchmark data sets. We subsequently compare the findings to corresponding results obtained on 16.5 million abstracts included in MEDLINE and show that text mining of full-text articles consistently outperforms using abstracts only.

KW - Abstracting and Indexing as Topic

KW - Area Under Curve

KW - Computational Biology/methods

KW - Data Mining/methods

KW - False Positive Reactions

KW - Genes

KW - Information Storage and Retrieval

KW - MEDLINE

KW - Periodicals as Topic

KW - Proteins/genetics

KW - ROC Curve

KW - Software

KW - Terminology as Topic

U2 - 10.1371/journal.pcbi.1005962

DO - 10.1371/journal.pcbi.1005962

M3 - Journal article

C2 - 29447159

SN - 1553-7358

VL - 14

JO - PLoS Computational Biology

JF - PLoS Computational Biology

IS - 2

M1 - e1005962

ER -

A comprehensive and quantitative comparison of text-mining in 15 million full-text articles versus their corresponding abstracts

Abstract

Keywords

Access to Document

Fingerprint

Cite this