Training big random forests with little resources

Fabian Gieseke; Christian Igel

doi:10.1145/3219819.3220124

Training big random forests with little resources

2 Citationer (Scopus)

Abstract

Without access to large compute clusters, building random forests on large datasets is still a challenging problem. This is, in particular, the case if fully-grown trees are desired. We propose a simple yet effective framework that allows to efficiently construct ensembles of huge trees for hundreds of millions or even billions of training instances using a cheap desktop computer with commodity hardware. The basic idea is to consider a multi-level construction scheme, which builds top trees for small random subsets of the available data and which subsequently distributes all training instances to the top trees' leaves for further processing. While being conceptually simple, the overall efficiency crucially depends on the particular implementation of the different phases. The practical merits of our approach are demonstrated using dense datasets with hundreds of millions of training instances.

Originalsprog	Engelsk
Titel	KDD 2018 - Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
Forlag	ACM Association for Computing Machinery
Publikationsdato	2018
Sider	1445-1454
ISBN (Trykt)	9781450355520
DOI	https://doi.org/10.1145/3219819.3220124
Status	Udgivet - 2018
Begivenhed	24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2018 - London, Storbritannien Varighed: 19 aug. 2018 → 23 aug. 2018

Konference

Konference	24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2018
Land/Område	Storbritannien
By	London
Periode	19/08/2018 → 23/08/2018
Sponsor	ACM SIGKDD, ACM SIGMOD

Adgang til dokumentet

10.1145/3219819.3220124

Andre filer og links

Link to publication in Scopus

Citationsformater

Gieseke, F & Igel, C 2018, Training big random forests with little resources. i KDD 2018 - Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM Association for Computing Machinery, s. 1445-1454, 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2018, London, Storbritannien, 19/08/2018. https://doi.org/10.1145/3219819.3220124

@inproceedings{2ae9626251144a8e8868807110c844ad,

title = "Training big random forests with little resources",

abstract = "Without access to large compute clusters, building random forests on large datasets is still a challenging problem. This is, in particular, the case if fully-grown trees are desired. We propose a simple yet effective framework that allows to efficiently construct ensembles of huge trees for hundreds of millions or even billions of training instances using a cheap desktop computer with commodity hardware. The basic idea is to consider a multi-level construction scheme, which builds top trees for small random subsets of the available data and which subsequently distributes all training instances to the top trees' leaves for further processing. While being conceptually simple, the overall efficiency crucially depends on the particular implementation of the different phases. The practical merits of our approach are demonstrated using dense datasets with hundreds of millions of training instances.",

keywords = "Classification, Ensemble methods, Large-scale data analytics, Machine learning, Random forests, Regression trees",

author = "Fabian Gieseke and Christian Igel",

year = "2018",

doi = "10.1145/3219819.3220124",

language = "English",

isbn = "9781450355520",

pages = "1445--1454",

booktitle = "KDD 2018 - Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining",

publisher = "ACM Association for Computing Machinery",

note = "24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2018 ; Conference date: 19-08-2018 Through 23-08-2018",

}

TY - GEN

T1 - Training big random forests with little resources

AU - Gieseke, Fabian

AU - Igel, Christian

PY - 2018

Y1 - 2018

N2 - Without access to large compute clusters, building random forests on large datasets is still a challenging problem. This is, in particular, the case if fully-grown trees are desired. We propose a simple yet effective framework that allows to efficiently construct ensembles of huge trees for hundreds of millions or even billions of training instances using a cheap desktop computer with commodity hardware. The basic idea is to consider a multi-level construction scheme, which builds top trees for small random subsets of the available data and which subsequently distributes all training instances to the top trees' leaves for further processing. While being conceptually simple, the overall efficiency crucially depends on the particular implementation of the different phases. The practical merits of our approach are demonstrated using dense datasets with hundreds of millions of training instances.

AB - Without access to large compute clusters, building random forests on large datasets is still a challenging problem. This is, in particular, the case if fully-grown trees are desired. We propose a simple yet effective framework that allows to efficiently construct ensembles of huge trees for hundreds of millions or even billions of training instances using a cheap desktop computer with commodity hardware. The basic idea is to consider a multi-level construction scheme, which builds top trees for small random subsets of the available data and which subsequently distributes all training instances to the top trees' leaves for further processing. While being conceptually simple, the overall efficiency crucially depends on the particular implementation of the different phases. The practical merits of our approach are demonstrated using dense datasets with hundreds of millions of training instances.

KW - Classification

KW - Ensemble methods

KW - Large-scale data analytics

KW - Machine learning

KW - Random forests

KW - Regression trees

UR - http://www.scopus.com/inward/record.url?scp=85051471641&partnerID=8YFLogxK

U2 - 10.1145/3219819.3220124

DO - 10.1145/3219819.3220124

M3 - Article in proceedings

AN - SCOPUS:85051471641

SN - 9781450355520

SP - 1445

EP - 1454

BT - KDD 2018 - Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

PB - ACM Association for Computing Machinery

T2 - 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2018

Y2 - 19 August 2018 through 23 August 2018

ER -

Training big random forests with little resources

Abstract

Konference

Adgang til dokumentet

Andre filer og links

Fingeraftryk

Citationsformater