A Guide to Dictionary-Based Text Mining

Helen V. Cook; Lars Juhl Jensen

doi:10.1007/978-1-4939-9089-4_5

A Guide to Dictionary-Based Text Mining

Helen V. Cook, Lars Juhl Jensen

Disease Systems Biology Program

6 Citationer (Scopus)

Abstract

PubMed contains more than 27 million documents, and this number is growing at an estimated 4% per year. Even within specialized topics, it is no longer possible for a researcher to read any field in its entirety, and thus nobody has a complete picture of the scientific knowledge in any given field at any time. Text mining provides a means to automatically read this corpus and to extract the relations found therein as structured information. Having data in a structured format is a huge boon for computational efforts to access, cross reference, and mine the data stored therein. This is increasingly useful as biological research is becoming more focused on systems and multi-omics integration. This chapter provides an overview of the steps that are required for text mining: tokenization, named entity recognition, normalization, event extraction, and benchmarking. It discusses a variety of approaches to these tasks and then goes into detail on how to prepare data for use specifically with the JensenLab tagger. This software uses a dictionary-based approach and provides the text mining evidence for STRING and several other databases.

Originalsprog	Engelsk
Titel	Bioinformatics and Drug Discovery
Redaktører	Richard S. Larson, Tudor I. Oprea
Antal sider	17
Vol/bind	1939
Forlag	Humana Press
Publikationsdato	2019
Udgave	3
Sider	73-89
ISBN (Trykt)	978-1-4939-9088-7
ISBN (Elektronisk)	978-1-4939-9089-4
DOI	https://doi.org/10.1007/978-1-4939-9089-4_5
Status	Udgivet - 2019

Navn	Methods in Molecular Biology
ISSN	1064-3745

Adgang til dokumentet

10.1007/978-1-4939-9089-4_5

Citationsformater

@inbook{908e1ce58ec24448b100deaa69e50bd7,

title = "A Guide to Dictionary-Based Text Mining",

abstract = "PubMed contains more than 27 million documents, and this number is growing at an estimated 4% per year. Even within specialized topics, it is no longer possible for a researcher to read any field in its entirety, and thus nobody has a complete picture of the scientific knowledge in any given field at any time. Text mining provides a means to automatically read this corpus and to extract the relations found therein as structured information. Having data in a structured format is a huge boon for computational efforts to access, cross reference, and mine the data stored therein. This is increasingly useful as biological research is becoming more focused on systems and multi-omics integration. This chapter provides an overview of the steps that are required for text mining: tokenization, named entity recognition, normalization, event extraction, and benchmarking. It discusses a variety of approaches to these tasks and then goes into detail on how to prepare data for use specifically with the JensenLab tagger. This software uses a dictionary-based approach and provides the text mining evidence for STRING and several other databases.",

author = "Cook, {Helen V.} and Jensen, {Lars Juhl}",

year = "2019",

doi = "10.1007/978-1-4939-9089-4_5",

language = "English",

isbn = " 978-1-4939-9088-7",

volume = "1939",

series = "Methods in Molecular Biology",

publisher = "Humana Press",

pages = "73--89",

editor = "Larson, {Richard S.} and Oprea, {Tudor I.}",

booktitle = "Bioinformatics and Drug Discovery",

address = "United States",

edition = "3",

}

TY - CHAP

T1 - A Guide to Dictionary-Based Text Mining

AU - Cook, Helen V.

AU - Jensen, Lars Juhl

PY - 2019

Y1 - 2019

N2 - PubMed contains more than 27 million documents, and this number is growing at an estimated 4% per year. Even within specialized topics, it is no longer possible for a researcher to read any field in its entirety, and thus nobody has a complete picture of the scientific knowledge in any given field at any time. Text mining provides a means to automatically read this corpus and to extract the relations found therein as structured information. Having data in a structured format is a huge boon for computational efforts to access, cross reference, and mine the data stored therein. This is increasingly useful as biological research is becoming more focused on systems and multi-omics integration. This chapter provides an overview of the steps that are required for text mining: tokenization, named entity recognition, normalization, event extraction, and benchmarking. It discusses a variety of approaches to these tasks and then goes into detail on how to prepare data for use specifically with the JensenLab tagger. This software uses a dictionary-based approach and provides the text mining evidence for STRING and several other databases.

AB - PubMed contains more than 27 million documents, and this number is growing at an estimated 4% per year. Even within specialized topics, it is no longer possible for a researcher to read any field in its entirety, and thus nobody has a complete picture of the scientific knowledge in any given field at any time. Text mining provides a means to automatically read this corpus and to extract the relations found therein as structured information. Having data in a structured format is a huge boon for computational efforts to access, cross reference, and mine the data stored therein. This is increasingly useful as biological research is becoming more focused on systems and multi-omics integration. This chapter provides an overview of the steps that are required for text mining: tokenization, named entity recognition, normalization, event extraction, and benchmarking. It discusses a variety of approaches to these tasks and then goes into detail on how to prepare data for use specifically with the JensenLab tagger. This software uses a dictionary-based approach and provides the text mining evidence for STRING and several other databases.

U2 - 10.1007/978-1-4939-9089-4_5

DO - 10.1007/978-1-4939-9089-4_5

M3 - Book chapter

C2 - 30848457

SN - 978-1-4939-9088-7

VL - 1939

T3 - Methods in Molecular Biology

SP - 73

EP - 89

BT - Bioinformatics and Drug Discovery

A2 - Larson, Richard S.

A2 - Oprea, Tudor I.

PB - Humana Press

ER -

A Guide to Dictionary-Based Text Mining

Abstract

Adgang til dokumentet

Fingeraftryk

Citationsformater