Training big random forests with little resources

Fabian Gieseke; Christian Igel

doi:10.1145/3219819.3220124

Training big random forests with little resources

2 Citations (Scopus)

Abstract

Without access to large compute clusters, building random forests on large datasets is still a challenging problem. This is, in particular, the case if fully-grown trees are desired. We propose a simple yet effective framework that allows to efficiently construct ensembles of huge trees for hundreds of millions or even billions of training instances using a cheap desktop computer with commodity hardware. The basic idea is to consider a multi-level construction scheme, which builds top trees for small random subsets of the available data and which subsequently distributes all training instances to the top trees' leaves for further processing. While being conceptually simple, the overall efficiency crucially depends on the particular implementation of the different phases. The practical merits of our approach are demonstrated using dense datasets with hundreds of millions of training instances.

Original language	English
Title of host publication	KDD 2018 - Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
Publisher	ACM Association for Computing Machinery
Publication date	2018
Pages	1445-1454
ISBN (Print)	9781450355520
DOIs	https://doi.org/10.1145/3219819.3220124
Publication status	Published - 2018
Event	24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2018 - London, United Kingdom Duration: 19 Aug 2018 → 23 Aug 2018

Conference

Conference	24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2018
Country/Territory	United Kingdom
City	London
Period	19/08/2018 → 23/08/2018
Sponsor	ACM SIGKDD, ACM SIGMOD

Keywords

Classification
Ensemble methods
Large-scale data analytics
Machine learning
Random forests
Regression trees

Access to Document

10.1145/3219819.3220124

Cite this

Training big random forests with little resources. / Gieseke, Fabian ; Igel, Christian.
KDD 2018 - Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM Association for Computing Machinery, 2018. p. 1445-1454.

Research output: Chapter in Book/Report/Conference proceeding › Article in proceedings › Research › peer-review

Gieseke, F & Igel, C 2018, Training big random forests with little resources. in KDD 2018 - Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM Association for Computing Machinery, pp. 1445-1454, 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2018, London, United Kingdom, 19/08/2018. https://doi.org/10.1145/3219819.3220124

@inproceedings{2ae9626251144a8e8868807110c844ad,

title = "Training big random forests with little resources",

abstract = "Without access to large compute clusters, building random forests on large datasets is still a challenging problem. This is, in particular, the case if fully-grown trees are desired. We propose a simple yet effective framework that allows to efficiently construct ensembles of huge trees for hundreds of millions or even billions of training instances using a cheap desktop computer with commodity hardware. The basic idea is to consider a multi-level construction scheme, which builds top trees for small random subsets of the available data and which subsequently distributes all training instances to the top trees' leaves for further processing. While being conceptually simple, the overall efficiency crucially depends on the particular implementation of the different phases. The practical merits of our approach are demonstrated using dense datasets with hundreds of millions of training instances.",

keywords = "Classification, Ensemble methods, Large-scale data analytics, Machine learning, Random forests, Regression trees",

author = "Fabian Gieseke and Christian Igel",

year = "2018",

doi = "10.1145/3219819.3220124",

language = "English",

isbn = "9781450355520",

pages = "1445--1454",

booktitle = "KDD 2018 - Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining",

publisher = "ACM Association for Computing Machinery",

note = "24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2018 ; Conference date: 19-08-2018 Through 23-08-2018",

}

TY - GEN

T1 - Training big random forests with little resources

AU - Gieseke, Fabian

AU - Igel, Christian

PY - 2018

Y1 - 2018

N2 - Without access to large compute clusters, building random forests on large datasets is still a challenging problem. This is, in particular, the case if fully-grown trees are desired. We propose a simple yet effective framework that allows to efficiently construct ensembles of huge trees for hundreds of millions or even billions of training instances using a cheap desktop computer with commodity hardware. The basic idea is to consider a multi-level construction scheme, which builds top trees for small random subsets of the available data and which subsequently distributes all training instances to the top trees' leaves for further processing. While being conceptually simple, the overall efficiency crucially depends on the particular implementation of the different phases. The practical merits of our approach are demonstrated using dense datasets with hundreds of millions of training instances.

AB - Without access to large compute clusters, building random forests on large datasets is still a challenging problem. This is, in particular, the case if fully-grown trees are desired. We propose a simple yet effective framework that allows to efficiently construct ensembles of huge trees for hundreds of millions or even billions of training instances using a cheap desktop computer with commodity hardware. The basic idea is to consider a multi-level construction scheme, which builds top trees for small random subsets of the available data and which subsequently distributes all training instances to the top trees' leaves for further processing. While being conceptually simple, the overall efficiency crucially depends on the particular implementation of the different phases. The practical merits of our approach are demonstrated using dense datasets with hundreds of millions of training instances.

KW - Classification

KW - Ensemble methods

KW - Large-scale data analytics

KW - Machine learning

KW - Random forests

KW - Regression trees

UR - http://www.scopus.com/inward/record.url?scp=85051471641&partnerID=8YFLogxK

U2 - 10.1145/3219819.3220124

DO - 10.1145/3219819.3220124

M3 - Article in proceedings

AN - SCOPUS:85051471641

SN - 9781450355520

SP - 1445

EP - 1454

BT - KDD 2018 - Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

PB - ACM Association for Computing Machinery

T2 - 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2018

Y2 - 19 August 2018 through 23 August 2018

ER -

Training big random forests with little resources

Abstract

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this