A Guide to Dictionary-Based Text Mining

Helen V. Cook; Lars Juhl Jensen

doi:10.1007/978-1-4939-9089-4_5

A Guide to Dictionary-Based Text Mining

Helen V. Cook, Lars Juhl Jensen

Disease Systems Biology Program

6 Citations (Scopus)

Abstract

PubMed contains more than 27 million documents, and this number is growing at an estimated 4% per year. Even within specialized topics, it is no longer possible for a researcher to read any field in its entirety, and thus nobody has a complete picture of the scientific knowledge in any given field at any time. Text mining provides a means to automatically read this corpus and to extract the relations found therein as structured information. Having data in a structured format is a huge boon for computational efforts to access, cross reference, and mine the data stored therein. This is increasingly useful as biological research is becoming more focused on systems and multi-omics integration. This chapter provides an overview of the steps that are required for text mining: tokenization, named entity recognition, normalization, event extraction, and benchmarking. It discusses a variety of approaches to these tasks and then goes into detail on how to prepare data for use specifically with the JensenLab tagger. This software uses a dictionary-based approach and provides the text mining evidence for STRING and several other databases.

Original language	English
Title of host publication	Bioinformatics and Drug Discovery
Editors	Richard S. Larson, Tudor I. Oprea
Number of pages	17
Volume	1939
Publisher	Humana Press
Publication date	2019
Edition	3
Pages	73-89
ISBN (Print)	978-1-4939-9088-7
ISBN (Electronic)	978-1-4939-9089-4
DOIs	https://doi.org/10.1007/978-1-4939-9089-4_5
Publication status	Published - 2019

Series	Methods in Molecular Biology
ISSN	1064-3745

Access to Document

10.1007/978-1-4939-9089-4_5

Cite this

@inbook{908e1ce58ec24448b100deaa69e50bd7,

title = "A Guide to Dictionary-Based Text Mining",

abstract = "PubMed contains more than 27 million documents, and this number is growing at an estimated 4% per year. Even within specialized topics, it is no longer possible for a researcher to read any field in its entirety, and thus nobody has a complete picture of the scientific knowledge in any given field at any time. Text mining provides a means to automatically read this corpus and to extract the relations found therein as structured information. Having data in a structured format is a huge boon for computational efforts to access, cross reference, and mine the data stored therein. This is increasingly useful as biological research is becoming more focused on systems and multi-omics integration. This chapter provides an overview of the steps that are required for text mining: tokenization, named entity recognition, normalization, event extraction, and benchmarking. It discusses a variety of approaches to these tasks and then goes into detail on how to prepare data for use specifically with the JensenLab tagger. This software uses a dictionary-based approach and provides the text mining evidence for STRING and several other databases.",

author = "Cook, {Helen V.} and Jensen, {Lars Juhl}",

year = "2019",

doi = "10.1007/978-1-4939-9089-4_5",

language = "English",

isbn = " 978-1-4939-9088-7",

volume = "1939",

series = "Methods in Molecular Biology",

publisher = "Humana Press",

pages = "73--89",

editor = "Larson, {Richard S.} and Oprea, {Tudor I.}",

booktitle = "Bioinformatics and Drug Discovery",

address = "United States",

edition = "3",

}

TY - CHAP

T1 - A Guide to Dictionary-Based Text Mining

AU - Cook, Helen V.

AU - Jensen, Lars Juhl

PY - 2019

Y1 - 2019

N2 - PubMed contains more than 27 million documents, and this number is growing at an estimated 4% per year. Even within specialized topics, it is no longer possible for a researcher to read any field in its entirety, and thus nobody has a complete picture of the scientific knowledge in any given field at any time. Text mining provides a means to automatically read this corpus and to extract the relations found therein as structured information. Having data in a structured format is a huge boon for computational efforts to access, cross reference, and mine the data stored therein. This is increasingly useful as biological research is becoming more focused on systems and multi-omics integration. This chapter provides an overview of the steps that are required for text mining: tokenization, named entity recognition, normalization, event extraction, and benchmarking. It discusses a variety of approaches to these tasks and then goes into detail on how to prepare data for use specifically with the JensenLab tagger. This software uses a dictionary-based approach and provides the text mining evidence for STRING and several other databases.

AB - PubMed contains more than 27 million documents, and this number is growing at an estimated 4% per year. Even within specialized topics, it is no longer possible for a researcher to read any field in its entirety, and thus nobody has a complete picture of the scientific knowledge in any given field at any time. Text mining provides a means to automatically read this corpus and to extract the relations found therein as structured information. Having data in a structured format is a huge boon for computational efforts to access, cross reference, and mine the data stored therein. This is increasingly useful as biological research is becoming more focused on systems and multi-omics integration. This chapter provides an overview of the steps that are required for text mining: tokenization, named entity recognition, normalization, event extraction, and benchmarking. It discusses a variety of approaches to these tasks and then goes into detail on how to prepare data for use specifically with the JensenLab tagger. This software uses a dictionary-based approach and provides the text mining evidence for STRING and several other databases.

U2 - 10.1007/978-1-4939-9089-4_5

DO - 10.1007/978-1-4939-9089-4_5

M3 - Book chapter

C2 - 30848457

SN - 978-1-4939-9088-7

VL - 1939

T3 - Methods in Molecular Biology

SP - 73

EP - 89

BT - Bioinformatics and Drug Discovery

A2 - Larson, Richard S.

A2 - Oprea, Tudor I.

PB - Humana Press

ER -

A Guide to Dictionary-Based Text Mining

Abstract

Access to Document

Fingerprint

Cite this