Nearest neighbor density ratio estimation for large-scale applications in astronomy

9 Citations (Scopus)

Abstract

In astronomical applications of machine learning, the distribution of objects used for building a model is often different from the distribution of the objects the model is later applied to. This is known as sample selection bias, which is a major challenge for statistical inference as one can no longer assume that the labeled training data are representative. To address this issue, one can re-weight the labeled training patterns to match the distribution of unlabeled data that are available already in the training phase. There are many examples in practice where this strategy yielded good results, but estimating the weights reliably from a finite sample is challenging. We consider an efficient nearest neighbor density ratio estimator that can exploit large samples to increase the accuracy of the weight estimates. To solve the problem of choosing the right neighborhood size, we propose to use cross-validation on a model selection criterion that is unbiased under covariate shift. The resulting algorithm is our method of choice for density ratio estimation when the feature space dimensionality is small and sample sizes are large. The approach is simple and, because of the model selection, robust. We empirically find that it is on a par with established kernel-based methods on relatively small regression benchmark datasets. However, when applied to large-scale photometric redshift estimation, our approach outperforms the state-of-the-art.

Original languageEnglish
JournalAstronomy and Computing
Volume12
Pages (from-to)67-72
Number of pages6
ISSN2213-1337
DOIs
Publication statusPublished - 1 Sept 2015

Keywords

  • Large-scale learning

Fingerprint

Dive into the research topics of 'Nearest neighbor density ratio estimation for large-scale applications in astronomy'. Together they form a unique fingerprint.

Cite this