Large scale identification and categorization of protein sequences using structured logistic regression

Bjørn Panella Pedersen; Georgiana Ifrim; Poul Liboriussen; Kristian B Axelsen; Michael Broberg Palmgren; Poul Nissen; Carsten Henrik Wiuf; Christian N S Pedersen

doi:10.1371/journal.pone.0085139

Large scale identification and categorization of protein sequences using structured logistic regression

Bjørn Panella Pedersen, Georgiana Ifrim, Poul Liboriussen, Kristian B Axelsen, Michael Broberg Palmgren, Poul Nissen, Carsten Henrik Wiuf, Christian N S Pedersen

9 Citations (Scopus)

1113 Downloads (Pure)

Abstract

Background: Structured Logistic Regression (SLR) is a newly developed machine learning tool first proposed in the context of text categorization. Current availability of extensive protein sequence databases calls for an automated method to reliably classify sequences and SLR seems well-suited for this task. The classification of P-type ATPases, a large family of ATP-driven membrane pumps transporting essential cations, was selected as a test-case that would generate important biological information as well as provide a proof-of-concept for the application of SLR to a large scale bioinformatics problem. Results: Using SLR, we have built classifiers to identify and automatically categorize P-type ATPases into one of 11 predefined classes. The SLR-classifiers are compared to a Hidden Markov Model approach and shown to be highly accurate and scalable. Representing the bulk of currently known sequences, we analysed 9.3 million sequences in the UniProtKB and attempted to classify a large number of P-type ATPases. To examine the distribution of pumps on organisms, we also applied SLR to 1,123 complete genomes from the Entrez genome database. Finally, we analysed the predicted membrane topology of the identified P-type ATPases. Conclusions: Using the SLR-based classification tool we are able to run a large scale study of P-type ATPases. This study provides proof-of-concept for the application of SLR to a bioinformatics problem and the analysis of P-type ATPases pinpoints new and interesting targets for further biochemical characterization and structural analysis.

Original language	English
Article number	e85139
Journal	PLOS ONE
Volume	9
Issue number	1
Number of pages	11
ISSN	1932-6203
DOIs	https://doi.org/10.1371/journal.pone.0085139
Publication status	Published - 20 Jan 2014

Access to Document

10.1371/journal.pone.0085139

Large Scale Identification and Categorization of Protein ...Final published version, 890 KB

Cite this

@article{3e2e68d14b43459790d4efd03708868d,

title = "Large scale identification and categorization of protein sequences using structured logistic regression",

abstract = "Background: Structured Logistic Regression (SLR) is a newly developed machine learning tool first proposed in the context of text categorization. Current availability of extensive protein sequence databases calls for an automated method to reliably classify sequences and SLR seems well-suited for this task. The classification of P-type ATPases, a large family of ATP-driven membrane pumps transporting essential cations, was selected as a test-case that would generate important biological information as well as provide a proof-of-concept for the application of SLR to a large scale bioinformatics problem. Results: Using SLR, we have built classifiers to identify and automatically categorize P-type ATPases into one of 11 predefined classes. The SLR-classifiers are compared to a Hidden Markov Model approach and shown to be highly accurate and scalable. Representing the bulk of currently known sequences, we analysed 9.3 million sequences in the UniProtKB and attempted to classify a large number of P-type ATPases. To examine the distribution of pumps on organisms, we also applied SLR to 1,123 complete genomes from the Entrez genome database. Finally, we analysed the predicted membrane topology of the identified P-type ATPases. Conclusions: Using the SLR-based classification tool we are able to run a large scale study of P-type ATPases. This study provides proof-of-concept for the application of SLR to a bioinformatics problem and the analysis of P-type ATPases pinpoints new and interesting targets for further biochemical characterization and structural analysis.",

author = "Pedersen, {Bj{\o}rn Panella} and Georgiana Ifrim and Poul Liboriussen and Axelsen, {Kristian B} and Palmgren, {Michael Broberg} and Poul Nissen and Wiuf, {Carsten Henrik} and Pedersen, {Christian N S}",

year = "2014",

month = jan,

day = "20",

doi = "10.1371/journal.pone.0085139",

language = "English",

volume = "9",

journal = "PLOS ONE",

issn = "1932-6203",

publisher = "Public Library of Science",

number = "1",

}

TY - JOUR

T1 - Large scale identification and categorization of protein sequences using structured logistic regression

AU - Pedersen, Bjørn Panella

AU - Ifrim, Georgiana

AU - Liboriussen, Poul

AU - Axelsen, Kristian B

AU - Palmgren, Michael Broberg

AU - Nissen, Poul

AU - Wiuf, Carsten Henrik

AU - Pedersen, Christian N S

PY - 2014/1/20

Y1 - 2014/1/20

N2 - Background: Structured Logistic Regression (SLR) is a newly developed machine learning tool first proposed in the context of text categorization. Current availability of extensive protein sequence databases calls for an automated method to reliably classify sequences and SLR seems well-suited for this task. The classification of P-type ATPases, a large family of ATP-driven membrane pumps transporting essential cations, was selected as a test-case that would generate important biological information as well as provide a proof-of-concept for the application of SLR to a large scale bioinformatics problem. Results: Using SLR, we have built classifiers to identify and automatically categorize P-type ATPases into one of 11 predefined classes. The SLR-classifiers are compared to a Hidden Markov Model approach and shown to be highly accurate and scalable. Representing the bulk of currently known sequences, we analysed 9.3 million sequences in the UniProtKB and attempted to classify a large number of P-type ATPases. To examine the distribution of pumps on organisms, we also applied SLR to 1,123 complete genomes from the Entrez genome database. Finally, we analysed the predicted membrane topology of the identified P-type ATPases. Conclusions: Using the SLR-based classification tool we are able to run a large scale study of P-type ATPases. This study provides proof-of-concept for the application of SLR to a bioinformatics problem and the analysis of P-type ATPases pinpoints new and interesting targets for further biochemical characterization and structural analysis.

AB - Background: Structured Logistic Regression (SLR) is a newly developed machine learning tool first proposed in the context of text categorization. Current availability of extensive protein sequence databases calls for an automated method to reliably classify sequences and SLR seems well-suited for this task. The classification of P-type ATPases, a large family of ATP-driven membrane pumps transporting essential cations, was selected as a test-case that would generate important biological information as well as provide a proof-of-concept for the application of SLR to a large scale bioinformatics problem. Results: Using SLR, we have built classifiers to identify and automatically categorize P-type ATPases into one of 11 predefined classes. The SLR-classifiers are compared to a Hidden Markov Model approach and shown to be highly accurate and scalable. Representing the bulk of currently known sequences, we analysed 9.3 million sequences in the UniProtKB and attempted to classify a large number of P-type ATPases. To examine the distribution of pumps on organisms, we also applied SLR to 1,123 complete genomes from the Entrez genome database. Finally, we analysed the predicted membrane topology of the identified P-type ATPases. Conclusions: Using the SLR-based classification tool we are able to run a large scale study of P-type ATPases. This study provides proof-of-concept for the application of SLR to a bioinformatics problem and the analysis of P-type ATPases pinpoints new and interesting targets for further biochemical characterization and structural analysis.

U2 - 10.1371/journal.pone.0085139

DO - 10.1371/journal.pone.0085139

M3 - Journal article

C2 - 24465495

SN - 1932-6203

VL - 9

JO - PLOS ONE

JF - PLOS ONE

IS - 1

M1 - e85139

ER -

Large scale identification and categorization of protein sequences using structured logistic regression

Abstract

Access to Document

Fingerprint

Cite this