Selecting informative data for developing peptide-MHC binding predictors using a query by committee approach

Jens Kaae Christensen; Kasper Lamberth; Morten Nielsen; Claus Lundegaard; Peder Worning; Sanne Lise Lauemøller; Søren Buus; Søren Brunak; Ole Lund

doi:10.1162/089976603322518803

Selecting informative data for developing peptide-MHC binding predictors using a query by committee approach

Jens Kaae Christensen, Kasper Lamberth, Morten Nielsen, Claus Lundegaard, Peder Worning, Sanne Lise Lauemøller, Søren Buus, Søren Brunak, Ole Lund

Department of Immunology and Microbiology

13 Citations (Scopus)

Abstract

Strategies for selecting informative data points for training prediction algorithms are important, particularly when data points are difficult and costly to obtain. A Query by Committee (QBC) training strategy for selecting new data points uses the disagreement between a committee of different algorithms to suggest new data points, which most rationally complement existing data, that is, they are the most informative data points. In order to evaluate this QBC approach on a real-world problem, we compared strategies for selecting new data points. We trained neural network algorithms to obtain methods to predict the binding affinity of peptides binding to the MHC class I molecule, HLA-A2. We show that the QBC strategy leads to a higher performance than a baseline strategy where new data points are selected at random from a pool of available data. Most peptides bind HLA-A2 with a low affinity, and as expected using a strategy of selecting peptides that are predicted to have high binding affinities also lead to more accurate predictors than the base line strategy. The QBC value is shown to correlate with the measured binding affinity. This demonstrates that the different predictors can easily learn if a peptide will fail to bind, but often conflict in predicting if a peptide binds. Using a carefully constructed computational setup, we demonstrate that selecting peptides with a high QBC performs better than low QBC peptides independently from binding affinity. When predictors are trained on a very limited set of data they cannot be expected to disagree in a meaningful way and we find a data limit below which the QBC strategy fails. Finally, it should be noted that data selection strategies similar to those used here might be of use in other settings in which generation of more data is a costly process.

Original language	English
Journal	Neural Computation
Volume	15
Issue number	12
Pages (from-to)	2931-42
Number of pages	11
ISSN	0899-7667
DOIs	https://doi.org/10.1162/089976603322518803
Publication status	Published - 2003

Access to Document

10.1162/089976603322518803

Cite this

@article{1151e0d0ebcb11ddbf70000ea68e967b,

title = "Selecting informative data for developing peptide-MHC binding predictors using a query by committee approach",

abstract = "Strategies for selecting informative data points for training prediction algorithms are important, particularly when data points are difficult and costly to obtain. A Query by Committee (QBC) training strategy for selecting new data points uses the disagreement between a committee of different algorithms to suggest new data points, which most rationally complement existing data, that is, they are the most informative data points. In order to evaluate this QBC approach on a real-world problem, we compared strategies for selecting new data points. We trained neural network algorithms to obtain methods to predict the binding affinity of peptides binding to the MHC class I molecule, HLA-A2. We show that the QBC strategy leads to a higher performance than a baseline strategy where new data points are selected at random from a pool of available data. Most peptides bind HLA-A2 with a low affinity, and as expected using a strategy of selecting peptides that are predicted to have high binding affinities also lead to more accurate predictors than the base line strategy. The QBC value is shown to correlate with the measured binding affinity. This demonstrates that the different predictors can easily learn if a peptide will fail to bind, but often conflict in predicting if a peptide binds. Using a carefully constructed computational setup, we demonstrate that selecting peptides with a high QBC performs better than low QBC peptides independently from binding affinity. When predictors are trained on a very limited set of data they cannot be expected to disagree in a meaningful way and we find a data limit below which the QBC strategy fails. Finally, it should be noted that data selection strategies similar to those used here might be of use in other settings in which generation of more data is a costly process.",

author = "Christensen, {Jens Kaae} and Kasper Lamberth and Morten Nielsen and Claus Lundegaard and Peder Worning and Lauem{\o}ller, {Sanne Lise} and S{\o}ren Buus and S{\o}ren Brunak and Ole Lund",

note = "Keywords: Algorithms; Animals; Binding Sites; Drug Design; Epitopes; HLA-A2 Antigen; Histocompatibility Antigens Class I; Humans; Neural Networks (Computer); Peptides; Predictive Value of Tests; Protein Binding; Statistics as Topic; Vaccines",

year = "2003",

doi = "10.1162/089976603322518803",

language = "English",

volume = "15",

pages = "2931--42",

journal = "Neural Computation",

issn = "0899-7667",

publisher = "M I T Press",

number = "12",

}

TY - JOUR

T1 - Selecting informative data for developing peptide-MHC binding predictors using a query by committee approach

AU - Christensen, Jens Kaae

AU - Lamberth, Kasper

AU - Nielsen, Morten

AU - Lundegaard, Claus

AU - Worning, Peder

AU - Lauemøller, Sanne Lise

AU - Buus, Søren

AU - Brunak, Søren

AU - Lund, Ole

N1 - Keywords: Algorithms; Animals; Binding Sites; Drug Design; Epitopes; HLA-A2 Antigen; Histocompatibility Antigens Class I; Humans; Neural Networks (Computer); Peptides; Predictive Value of Tests; Protein Binding; Statistics as Topic; Vaccines

PY - 2003

Y1 - 2003

N2 - Strategies for selecting informative data points for training prediction algorithms are important, particularly when data points are difficult and costly to obtain. A Query by Committee (QBC) training strategy for selecting new data points uses the disagreement between a committee of different algorithms to suggest new data points, which most rationally complement existing data, that is, they are the most informative data points. In order to evaluate this QBC approach on a real-world problem, we compared strategies for selecting new data points. We trained neural network algorithms to obtain methods to predict the binding affinity of peptides binding to the MHC class I molecule, HLA-A2. We show that the QBC strategy leads to a higher performance than a baseline strategy where new data points are selected at random from a pool of available data. Most peptides bind HLA-A2 with a low affinity, and as expected using a strategy of selecting peptides that are predicted to have high binding affinities also lead to more accurate predictors than the base line strategy. The QBC value is shown to correlate with the measured binding affinity. This demonstrates that the different predictors can easily learn if a peptide will fail to bind, but often conflict in predicting if a peptide binds. Using a carefully constructed computational setup, we demonstrate that selecting peptides with a high QBC performs better than low QBC peptides independently from binding affinity. When predictors are trained on a very limited set of data they cannot be expected to disagree in a meaningful way and we find a data limit below which the QBC strategy fails. Finally, it should be noted that data selection strategies similar to those used here might be of use in other settings in which generation of more data is a costly process.

AB - Strategies for selecting informative data points for training prediction algorithms are important, particularly when data points are difficult and costly to obtain. A Query by Committee (QBC) training strategy for selecting new data points uses the disagreement between a committee of different algorithms to suggest new data points, which most rationally complement existing data, that is, they are the most informative data points. In order to evaluate this QBC approach on a real-world problem, we compared strategies for selecting new data points. We trained neural network algorithms to obtain methods to predict the binding affinity of peptides binding to the MHC class I molecule, HLA-A2. We show that the QBC strategy leads to a higher performance than a baseline strategy where new data points are selected at random from a pool of available data. Most peptides bind HLA-A2 with a low affinity, and as expected using a strategy of selecting peptides that are predicted to have high binding affinities also lead to more accurate predictors than the base line strategy. The QBC value is shown to correlate with the measured binding affinity. This demonstrates that the different predictors can easily learn if a peptide will fail to bind, but often conflict in predicting if a peptide binds. Using a carefully constructed computational setup, we demonstrate that selecting peptides with a high QBC performs better than low QBC peptides independently from binding affinity. When predictors are trained on a very limited set of data they cannot be expected to disagree in a meaningful way and we find a data limit below which the QBC strategy fails. Finally, it should be noted that data selection strategies similar to those used here might be of use in other settings in which generation of more data is a costly process.

U2 - 10.1162/089976603322518803

DO - 10.1162/089976603322518803

M3 - Journal article

C2 - 14629874

SN - 0899-7667

VL - 15

SP - 2931

EP - 2942

JO - Neural Computation

JF - Neural Computation

IS - 12

ER -

Selecting informative data for developing peptide-MHC binding predictors using a query by committee approach

Abstract

Access to Document

Fingerprint

Cite this