Bounded coordinate-descent for biological sequence classification in high dimensional predictor space

Georgiana Ifrim; Carsten Wiuf

doi:10.1145/2020408.2020519

Bounded coordinate-descent for biological sequence classification in high dimensional predictor space

Georgiana Ifrim^*, Carsten Wiuf

^*Corresponding author af dette arbejde

20 Citationer (Scopus)

Abstract

We present a framework for discriminative sequence classification where linear classifiers work directly in the explicit high-dimensional predictor space of all subsequences in the training set (as opposed to kernel-induced spaces). This is made feasible by employing a gradient-bounded coordinatedescent algorithm for efficiently selecting discriminative subsequences without having to expand the whole space. Our framework can be applied to a wide range of loss functions, including binomial log-likelihood loss of logistic regression and squared hinge loss of support vector machines. When applied to protein remote homology detection and remote fold recognition, our framework achieves comparable performance to the state-of-the-art (e.g., kernel support vector machines). In contrast to state-of-the-art sequence classifiers, our models are simply lists of weighted discriminative subsequences and can thus be interpreted and related to the biological problem - a crucial requirement for the bioinformatics and medical communities.

Originalsprog	Engelsk
Titel	Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD'11
Antal sider	9
Publikationsdato	16 sep. 2011
Sider	708-716
ISBN (Trykt)	9781450308137
DOI	https://doi.org/10.1145/2020408.2020519
Status	Udgivet - 16 sep. 2011
Udgivet eksternt	Ja
Begivenhed	17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD'11 - San Diego, CA, USA Varighed: 21 aug. 2011 → 24 aug. 2011

Konference

Konference	17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD'11
Land/Område	USA
By	San Diego, CA
Periode	21/08/2011 → 24/08/2011
Sponsor	ACM Spec. Interest Group Knowl. Discov. Data (SIGKDD), ACM SIGMOD

Adgang til dokumentet

10.1145/2020408.2020519

Andre filer og links

Link to publication in Scopus

Citationsformater

Bounded coordinate-descent for biological sequence classification in high dimensional predictor space. / Ifrim, Georgiana; Wiuf, Carsten.
Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD'11. 2011. s. 708-716.

Publikation: Bidrag til bog/antologi/rapport › Konferencebidrag i proceedings › Forskning › peer review

Ifrim, G & Wiuf, C 2011, Bounded coordinate-descent for biological sequence classification in high dimensional predictor space. i Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD'11. s. 708-716, 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD'11, San Diego, CA, USA, 21/08/2011. https://doi.org/10.1145/2020408.2020519

@inproceedings{6ebabeaa01234698bdbcc6c3b330d1d0,

title = "Bounded coordinate-descent for biological sequence classification in high dimensional predictor space",

abstract = "We present a framework for discriminative sequence classification where linear classifiers work directly in the explicit high-dimensional predictor space of all subsequences in the training set (as opposed to kernel-induced spaces). This is made feasible by employing a gradient-bounded coordinatedescent algorithm for efficiently selecting discriminative subsequences without having to expand the whole space. Our framework can be applied to a wide range of loss functions, including binomial log-likelihood loss of logistic regression and squared hinge loss of support vector machines. When applied to protein remote homology detection and remote fold recognition, our framework achieves comparable performance to the state-of-the-art (e.g., kernel support vector machines). In contrast to state-of-the-art sequence classifiers, our models are simply lists of weighted discriminative subsequences and can thus be interpreted and related to the biological problem - a crucial requirement for the bioinformatics and medical communities.",

keywords = "Greedy coordinate-descent, Logistic regression, Sequence classification, String classification, Support vectormachines",

author = "Georgiana Ifrim and Carsten Wiuf",

year = "2011",

month = sep,

day = "16",

doi = "10.1145/2020408.2020519",

language = "English",

isbn = "9781450308137",

pages = "708--716",

booktitle = "Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD'11",

note = "17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD'11 ; Conference date: 21-08-2011 Through 24-08-2011",

}

TY - GEN

T1 - Bounded coordinate-descent for biological sequence classification in high dimensional predictor space

AU - Ifrim, Georgiana

AU - Wiuf, Carsten

PY - 2011/9/16

Y1 - 2011/9/16

N2 - We present a framework for discriminative sequence classification where linear classifiers work directly in the explicit high-dimensional predictor space of all subsequences in the training set (as opposed to kernel-induced spaces). This is made feasible by employing a gradient-bounded coordinatedescent algorithm for efficiently selecting discriminative subsequences without having to expand the whole space. Our framework can be applied to a wide range of loss functions, including binomial log-likelihood loss of logistic regression and squared hinge loss of support vector machines. When applied to protein remote homology detection and remote fold recognition, our framework achieves comparable performance to the state-of-the-art (e.g., kernel support vector machines). In contrast to state-of-the-art sequence classifiers, our models are simply lists of weighted discriminative subsequences and can thus be interpreted and related to the biological problem - a crucial requirement for the bioinformatics and medical communities.

AB - We present a framework for discriminative sequence classification where linear classifiers work directly in the explicit high-dimensional predictor space of all subsequences in the training set (as opposed to kernel-induced spaces). This is made feasible by employing a gradient-bounded coordinatedescent algorithm for efficiently selecting discriminative subsequences without having to expand the whole space. Our framework can be applied to a wide range of loss functions, including binomial log-likelihood loss of logistic regression and squared hinge loss of support vector machines. When applied to protein remote homology detection and remote fold recognition, our framework achieves comparable performance to the state-of-the-art (e.g., kernel support vector machines). In contrast to state-of-the-art sequence classifiers, our models are simply lists of weighted discriminative subsequences and can thus be interpreted and related to the biological problem - a crucial requirement for the bioinformatics and medical communities.

KW - Greedy coordinate-descent

KW - Logistic regression

KW - Sequence classification

KW - String classification

KW - Support vectormachines

UR - http://www.scopus.com/inward/record.url?scp=80052661040&partnerID=8YFLogxK

U2 - 10.1145/2020408.2020519

DO - 10.1145/2020408.2020519

M3 - Article in proceedings

AN - SCOPUS:80052661040

SN - 9781450308137

SP - 708

EP - 716

BT - Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD'11

T2 - 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD'11

Y2 - 21 August 2011 through 24 August 2011

ER -

Bounded coordinate-descent for biological sequence classification in high dimensional predictor space

Abstract

Konference

Adgang til dokumentet

Andre filer og links

Fingeraftryk

Citationsformater