Nearest neighbor density ratio estimation for large-scale applications in astronomy

Jan Kremer; Fabian Gieseke; Kim Steenstrup Pedersen; Christian Igel

doi:10.1016/j.ascom.2015.06.005

Nearest neighbor density ratio estimation for large-scale applications in astronomy

Jan Kremer, Fabian Gieseke, Kim Steenstrup Pedersen, Christian Igel

Datalogisk Institut

9 Citationer (Scopus)

Abstract

In astronomical applications of machine learning, the distribution of objects used for building a model is often different from the distribution of the objects the model is later applied to. This is known as sample selection bias, which is a major challenge for statistical inference as one can no longer assume that the labeled training data are representative. To address this issue, one can re-weight the labeled training patterns to match the distribution of unlabeled data that are available already in the training phase. There are many examples in practice where this strategy yielded good results, but estimating the weights reliably from a finite sample is challenging. We consider an efficient nearest neighbor density ratio estimator that can exploit large samples to increase the accuracy of the weight estimates. To solve the problem of choosing the right neighborhood size, we propose to use cross-validation on a model selection criterion that is unbiased under covariate shift. The resulting algorithm is our method of choice for density ratio estimation when the feature space dimensionality is small and sample sizes are large. The approach is simple and, because of the model selection, robust. We empirically find that it is on a par with established kernel-based methods on relatively small regression benchmark datasets. However, when applied to large-scale photometric redshift estimation, our approach outperforms the state-of-the-art.

Originalsprog	Engelsk
Tidsskrift	Astronomy and Computing
Vol/bind	12
Sider (fra-til)	67-72
Antal sider	6
ISSN	2213-1337
DOI	https://doi.org/10.1016/j.ascom.2015.06.005
Status	Udgivet - 1 sep. 2015

Adgang til dokumentet

10.1016/j.ascom.2015.06.005

Citationsformater

@article{d3a6b0ed5f5c4e3a82920bb7e6a224e8,

title = "Nearest neighbor density ratio estimation for large-scale applications in astronomy",

abstract = "In astronomical applications of machine learning, the distribution of objects used for building a model is often different from the distribution of the objects the model is later applied to. This is known as sample selection bias, which is a major challenge for statistical inference as one can no longer assume that the labeled training data are representative. To address this issue, one can re-weight the labeled training patterns to match the distribution of unlabeled data that are available already in the training phase. There are many examples in practice where this strategy yielded good results, but estimating the weights reliably from a finite sample is challenging. We consider an efficient nearest neighbor density ratio estimator that can exploit large samples to increase the accuracy of the weight estimates. To solve the problem of choosing the right neighborhood size, we propose to use cross-validation on a model selection criterion that is unbiased under covariate shift. The resulting algorithm is our method of choice for density ratio estimation when the feature space dimensionality is small and sample sizes are large. The approach is simple and, because of the model selection, robust. We empirically find that it is on a par with established kernel-based methods on relatively small regression benchmark datasets. However, when applied to large-scale photometric redshift estimation, our approach outperforms the state-of-the-art.",

keywords = "Large-scale learning",

author = "Jan Kremer and Fabian Gieseke and Pedersen, {Kim Steenstrup} and Christian Igel",

year = "2015",

month = sep,

day = "1",

doi = "10.1016/j.ascom.2015.06.005",

language = "English",

volume = "12",

pages = "67--72",

journal = "Astronomy and Computing",

issn = "2213-1337",

publisher = "Elsevier",

}

TY - JOUR

T1 - Nearest neighbor density ratio estimation for large-scale applications in astronomy

AU - Kremer, Jan

AU - Gieseke, Fabian

AU - Pedersen, Kim Steenstrup

AU - Igel, Christian

PY - 2015/9/1

Y1 - 2015/9/1

N2 - In astronomical applications of machine learning, the distribution of objects used for building a model is often different from the distribution of the objects the model is later applied to. This is known as sample selection bias, which is a major challenge for statistical inference as one can no longer assume that the labeled training data are representative. To address this issue, one can re-weight the labeled training patterns to match the distribution of unlabeled data that are available already in the training phase. There are many examples in practice where this strategy yielded good results, but estimating the weights reliably from a finite sample is challenging. We consider an efficient nearest neighbor density ratio estimator that can exploit large samples to increase the accuracy of the weight estimates. To solve the problem of choosing the right neighborhood size, we propose to use cross-validation on a model selection criterion that is unbiased under covariate shift. The resulting algorithm is our method of choice for density ratio estimation when the feature space dimensionality is small and sample sizes are large. The approach is simple and, because of the model selection, robust. We empirically find that it is on a par with established kernel-based methods on relatively small regression benchmark datasets. However, when applied to large-scale photometric redshift estimation, our approach outperforms the state-of-the-art.

AB - In astronomical applications of machine learning, the distribution of objects used for building a model is often different from the distribution of the objects the model is later applied to. This is known as sample selection bias, which is a major challenge for statistical inference as one can no longer assume that the labeled training data are representative. To address this issue, one can re-weight the labeled training patterns to match the distribution of unlabeled data that are available already in the training phase. There are many examples in practice where this strategy yielded good results, but estimating the weights reliably from a finite sample is challenging. We consider an efficient nearest neighbor density ratio estimator that can exploit large samples to increase the accuracy of the weight estimates. To solve the problem of choosing the right neighborhood size, we propose to use cross-validation on a model selection criterion that is unbiased under covariate shift. The resulting algorithm is our method of choice for density ratio estimation when the feature space dimensionality is small and sample sizes are large. The approach is simple and, because of the model selection, robust. We empirically find that it is on a par with established kernel-based methods on relatively small regression benchmark datasets. However, when applied to large-scale photometric redshift estimation, our approach outperforms the state-of-the-art.

KW - Large-scale learning

U2 - 10.1016/j.ascom.2015.06.005

DO - 10.1016/j.ascom.2015.06.005

M3 - Journal article

SN - 2213-1337

VL - 12

SP - 67

EP - 72

JO - Astronomy and Computing

JF - Astronomy and Computing

ER -

Nearest neighbor density ratio estimation for large-scale applications in astronomy

Abstract

Adgang til dokumentet

Fingeraftryk

Citationsformater