Dataset size and composition impact the reliability of performance benchmarks for peptide-MHC binding predictions

Yohan Kim; John Sidney; Søren Buus; Alessandro Sette; Morten Nielsen; Bjoern Peters

doi:10.1186/1471-2105-15-241

Dataset size and composition impact the reliability of performance benchmarks for peptide-MHC binding predictions

Yohan Kim, John Sidney, Søren Buus, Alessandro Sette, Morten Nielsen, Bjoern Peters

46 Citations (Scopus)

Abstract

BACKGROUND: It is important to accurately determine the performance of peptide:MHC binding predictions, as this enables users to compare and choose between different prediction methods and provides estimates of the expected error rate. Two common approaches to determine prediction performance are cross-validation, in which all available data are iteratively split into training and testing data, and the use of blind sets generated separately from the data used to construct the predictive method. In the present study, we have compared cross-validated prediction performances generated on our last benchmark dataset from 2009 with prediction performances generated on data subsequently added to the Immune Epitope Database (IEDB) which served as a blind set.

RESULTS: We found that cross-validated performances systematically overestimated performance on the blind set. This was found not to be due to the presence of similar peptides in the cross-validation dataset. Rather, we found that small size and low sequence/affinity diversity of either training or blind datasets were associated with large differences in cross-validated vs. blind prediction performances. We use these findings to derive quantitative rules of how large and diverse datasets need to be to provide generalizable performance estimates.

CONCLUSION: It has long been known that cross-validated prediction performance estimates often overestimate performance on independently generated blind set data. We here identify and quantify the specific factors contributing to this effect for MHC-I binding predictions. An increasing number of peptides for which MHC binding affinities are measured experimentally have been selected based on binding predictions and thus are less diverse than historic datasets sampling the entire sequence and affinity space, making them more difficult benchmark data sets. This has to be taken into account when comparing performance metrics between different benchmarks, and when deriving error estimates for predictions based on benchmark performance.

Original language	English
Journal	B M C Bioinformatics
Volume	15
Pages (from-to)	241
Number of pages	1
ISSN	1471-2105
DOIs	https://doi.org/10.1186/1471-2105-15-241
Publication status	Published - 14 Jul 2014

Keywords

Alleles
Animals
Benchmarking
Computational Biology
Epitopes
HLA Antigens
Humans
Mice
Oligopeptides
Protein Binding
Reproducibility of Results

Access to Document

10.1186/1471-2105-15-241

Cite this

@article{c7b5e5dcf1c04079bfa53a339a6da771,

title = "Dataset size and composition impact the reliability of performance benchmarks for peptide-MHC binding predictions",

abstract = "BACKGROUND: It is important to accurately determine the performance of peptide:MHC binding predictions, as this enables users to compare and choose between different prediction methods and provides estimates of the expected error rate. Two common approaches to determine prediction performance are cross-validation, in which all available data are iteratively split into training and testing data, and the use of blind sets generated separately from the data used to construct the predictive method. In the present study, we have compared cross-validated prediction performances generated on our last benchmark dataset from 2009 with prediction performances generated on data subsequently added to the Immune Epitope Database (IEDB) which served as a blind set.RESULTS: We found that cross-validated performances systematically overestimated performance on the blind set. This was found not to be due to the presence of similar peptides in the cross-validation dataset. Rather, we found that small size and low sequence/affinity diversity of either training or blind datasets were associated with large differences in cross-validated vs. blind prediction performances. We use these findings to derive quantitative rules of how large and diverse datasets need to be to provide generalizable performance estimates.CONCLUSION: It has long been known that cross-validated prediction performance estimates often overestimate performance on independently generated blind set data. We here identify and quantify the specific factors contributing to this effect for MHC-I binding predictions. An increasing number of peptides for which MHC binding affinities are measured experimentally have been selected based on binding predictions and thus are less diverse than historic datasets sampling the entire sequence and affinity space, making them more difficult benchmark data sets. This has to be taken into account when comparing performance metrics between different benchmarks, and when deriving error estimates for predictions based on benchmark performance.",

keywords = "Alleles, Animals, Benchmarking, Computational Biology, Epitopes, HLA Antigens, Humans, Mice, Oligopeptides, Protein Binding, Reproducibility of Results",

author = "Yohan Kim and John Sidney and S{\o}ren Buus and Alessandro Sette and Morten Nielsen and Bjoern Peters",

year = "2014",

month = jul,

day = "14",

doi = "10.1186/1471-2105-15-241",

language = "English",

volume = "15",

pages = "241",

journal = "B M C Bioinformatics",

issn = "1471-2105",

publisher = "BioMed Central Ltd.",

}

TY - JOUR

T1 - Dataset size and composition impact the reliability of performance benchmarks for peptide-MHC binding predictions

AU - Kim, Yohan

AU - Sidney, John

AU - Buus, Søren

AU - Sette, Alessandro

AU - Nielsen, Morten

AU - Peters, Bjoern

PY - 2014/7/14

Y1 - 2014/7/14

N2 - BACKGROUND: It is important to accurately determine the performance of peptide:MHC binding predictions, as this enables users to compare and choose between different prediction methods and provides estimates of the expected error rate. Two common approaches to determine prediction performance are cross-validation, in which all available data are iteratively split into training and testing data, and the use of blind sets generated separately from the data used to construct the predictive method. In the present study, we have compared cross-validated prediction performances generated on our last benchmark dataset from 2009 with prediction performances generated on data subsequently added to the Immune Epitope Database (IEDB) which served as a blind set.RESULTS: We found that cross-validated performances systematically overestimated performance on the blind set. This was found not to be due to the presence of similar peptides in the cross-validation dataset. Rather, we found that small size and low sequence/affinity diversity of either training or blind datasets were associated with large differences in cross-validated vs. blind prediction performances. We use these findings to derive quantitative rules of how large and diverse datasets need to be to provide generalizable performance estimates.CONCLUSION: It has long been known that cross-validated prediction performance estimates often overestimate performance on independently generated blind set data. We here identify and quantify the specific factors contributing to this effect for MHC-I binding predictions. An increasing number of peptides for which MHC binding affinities are measured experimentally have been selected based on binding predictions and thus are less diverse than historic datasets sampling the entire sequence and affinity space, making them more difficult benchmark data sets. This has to be taken into account when comparing performance metrics between different benchmarks, and when deriving error estimates for predictions based on benchmark performance.

AB - BACKGROUND: It is important to accurately determine the performance of peptide:MHC binding predictions, as this enables users to compare and choose between different prediction methods and provides estimates of the expected error rate. Two common approaches to determine prediction performance are cross-validation, in which all available data are iteratively split into training and testing data, and the use of blind sets generated separately from the data used to construct the predictive method. In the present study, we have compared cross-validated prediction performances generated on our last benchmark dataset from 2009 with prediction performances generated on data subsequently added to the Immune Epitope Database (IEDB) which served as a blind set.RESULTS: We found that cross-validated performances systematically overestimated performance on the blind set. This was found not to be due to the presence of similar peptides in the cross-validation dataset. Rather, we found that small size and low sequence/affinity diversity of either training or blind datasets were associated with large differences in cross-validated vs. blind prediction performances. We use these findings to derive quantitative rules of how large and diverse datasets need to be to provide generalizable performance estimates.CONCLUSION: It has long been known that cross-validated prediction performance estimates often overestimate performance on independently generated blind set data. We here identify and quantify the specific factors contributing to this effect for MHC-I binding predictions. An increasing number of peptides for which MHC binding affinities are measured experimentally have been selected based on binding predictions and thus are less diverse than historic datasets sampling the entire sequence and affinity space, making them more difficult benchmark data sets. This has to be taken into account when comparing performance metrics between different benchmarks, and when deriving error estimates for predictions based on benchmark performance.

KW - Alleles

KW - Animals

KW - Benchmarking

KW - Computational Biology

KW - Epitopes

KW - HLA Antigens

KW - Humans

KW - Mice

KW - Oligopeptides

KW - Protein Binding

KW - Reproducibility of Results

U2 - 10.1186/1471-2105-15-241

DO - 10.1186/1471-2105-15-241

M3 - Journal article

C2 - 25017736

SN - 1471-2105

VL - 15

SP - 241

JO - B M C Bioinformatics

JF - B M C Bioinformatics

ER -

Dataset size and composition impact the reliability of performance benchmarks for peptide-MHC binding predictions

Abstract

Keywords

Access to Document

Fingerprint

Cite this