The validation and assessment of machine learning: a game of prediction from high-dimensional data

Tune H Pers; Anders Albrechtsen; Claus Holst; Thorkild I A Sørensen; Thomas A Gerds

doi:10.1371/journal.pone.0006287

The validation and assessment of machine learning: a game of prediction from high-dimensional data

Tune H Pers, Anders Albrechtsen, Claus Holst, Thorkild I A Sørensen, Thomas A Gerds

21 Citations (Scopus)

Abstract

In applied statistics, tools from machine learning are popular for analyzing complex and high-dimensional data. However, few theoretical results are available that could guide to the appropriate machine learning tool in a new application. Initial development of an overall strategy thus often implies that multiple methods are tested and compared on the same set of data. This is particularly difficult in situations that are prone to over-fitting where the number of subjects is low compared to the number of potential predictors. The article presents a game which provides some grounds for conducting a fair model comparison. Each player selects a modeling strategy for predicting individual response from potential predictors. A strictly proper scoring rule, bootstrap cross-validation, and a set of rules are used to make the results obtained with different strategies comparable. To illustrate the ideas, the game is applied to data from the Nugenob Study where the aim is to predict the fat oxidation capacity based on conventional factors and high-dimensional metabolomics data. Three players have chosen to use support vector machines, LASSO, and random forests, respectively.

Original language	English
Journal	PLoS ONE
Volume	4
Issue number	8
Pages (from-to)	e6287
ISSN	1932-6203
DOIs	https://doi.org/10.1371/journal.pone.0006287
Publication status	Published - 2009

Access to Document

10.1371/journal.pone.0006287

Cite this

@article{936358407e0211df928f000ea68e967b,

title = "The validation and assessment of machine learning: a game of prediction from high-dimensional data",

abstract = "In applied statistics, tools from machine learning are popular for analyzing complex and high-dimensional data. However, few theoretical results are available that could guide to the appropriate machine learning tool in a new application. Initial development of an overall strategy thus often implies that multiple methods are tested and compared on the same set of data. This is particularly difficult in situations that are prone to over-fitting where the number of subjects is low compared to the number of potential predictors. The article presents a game which provides some grounds for conducting a fair model comparison. Each player selects a modeling strategy for predicting individual response from potential predictors. A strictly proper scoring rule, bootstrap cross-validation, and a set of rules are used to make the results obtained with different strategies comparable. To illustrate the ideas, the game is applied to data from the Nugenob Study where the aim is to predict the fat oxidation capacity based on conventional factors and high-dimensional metabolomics data. Three players have chosen to use support vector machines, LASSO, and random forests, respectively.",

author = "Pers, {Tune H} and Anders Albrechtsen and Claus Holst and S{\o}rensen, {Thorkild I A} and Gerds, {Thomas A}",

note = "Keywords: Computers; Humans; Learning; Models, Theoretical",

year = "2009",

doi = "10.1371/journal.pone.0006287",

language = "English",

volume = "4",

pages = "e6287",

journal = "PLoS Computational Biology",

issn = "1932-6203",

publisher = "Public Library of Science",

number = "8",

}

TY - JOUR

T1 - The validation and assessment of machine learning: a game of prediction from high-dimensional data

AU - Pers, Tune H

AU - Albrechtsen, Anders

AU - Holst, Claus

AU - Sørensen, Thorkild I A

AU - Gerds, Thomas A

N1 - Keywords: Computers; Humans; Learning; Models, Theoretical

PY - 2009

Y1 - 2009

N2 - In applied statistics, tools from machine learning are popular for analyzing complex and high-dimensional data. However, few theoretical results are available that could guide to the appropriate machine learning tool in a new application. Initial development of an overall strategy thus often implies that multiple methods are tested and compared on the same set of data. This is particularly difficult in situations that are prone to over-fitting where the number of subjects is low compared to the number of potential predictors. The article presents a game which provides some grounds for conducting a fair model comparison. Each player selects a modeling strategy for predicting individual response from potential predictors. A strictly proper scoring rule, bootstrap cross-validation, and a set of rules are used to make the results obtained with different strategies comparable. To illustrate the ideas, the game is applied to data from the Nugenob Study where the aim is to predict the fat oxidation capacity based on conventional factors and high-dimensional metabolomics data. Three players have chosen to use support vector machines, LASSO, and random forests, respectively.

AB - In applied statistics, tools from machine learning are popular for analyzing complex and high-dimensional data. However, few theoretical results are available that could guide to the appropriate machine learning tool in a new application. Initial development of an overall strategy thus often implies that multiple methods are tested and compared on the same set of data. This is particularly difficult in situations that are prone to over-fitting where the number of subjects is low compared to the number of potential predictors. The article presents a game which provides some grounds for conducting a fair model comparison. Each player selects a modeling strategy for predicting individual response from potential predictors. A strictly proper scoring rule, bootstrap cross-validation, and a set of rules are used to make the results obtained with different strategies comparable. To illustrate the ideas, the game is applied to data from the Nugenob Study where the aim is to predict the fat oxidation capacity based on conventional factors and high-dimensional metabolomics data. Three players have chosen to use support vector machines, LASSO, and random forests, respectively.

U2 - 10.1371/journal.pone.0006287

DO - 10.1371/journal.pone.0006287

M3 - Journal article

C2 - 19652722

SN - 1932-6203

VL - 4

SP - e6287

JO - PLoS Computational Biology

JF - PLoS Computational Biology

IS - 8

ER -

The validation and assessment of machine learning: a game of prediction from high-dimensional data

Abstract

Access to Document

Fingerprint

Cite this