Hierarchical Clustering: Objective Functions and Algorithms

Vincent Cohen-addad; Varun Kanade; Frederik Mallmann-trenn; Claire Mathieu

doi:10.1137/1.9781611975031.26

Hierarchical Clustering: Objective Functions and Algorithms

Vincent Cohen-addad, Varun Kanade, Frederik Mallmann-trenn, Claire Mathieu

Department of Computer Science

29 Citations (Scopus)

Abstract

Hierarchical clustering is a recursive partitioning of a dataset into clusters at an increasingly finer granularity. Motivated by the fact that most work on hierarchical clustering was based on providing algorithms, rather than optimizing a specific objective, [19] framed similarity-based hierarchical clustering as a combinatorial optimization problem, where a 'good' hierarchical clustering is one that minimizes some cost function. He showed that this cost function has certain desirable properties, such as in order to achieve optimal cost, disconnected components must be separated first and that in 'structureless' graphs, i.e., cliques, all clusterings achieve the same cost. We take an axiomatic approach to defining 'good' objective functions for both similarity and dissimilarity-based hierarchical clustering. We characterize a set of admissible objective functions (that includes the one introduced by Dasgupta) that have the property that when the input admits a 'natural' ground-truth hierarchical clustering, the groundtruth clustering has an optimal value. Equipped with a suitable objective function, we analyze the performance of practical algorithms, as well as develop better and faster algorithms for hierarchical clustering. For similarity-based hierarchical clustering, [19] showed that a simple recursive sparsest-cut based approach achieves an Oplog3-2 nqapproximation on worst-case inputs. We give a more refined analysis of the algorithm and show that it in fact achieves an Op ? log nq-approximation1. This improves upon the LP-based Oplog nq-approximation of [33]. For dissimilarity-based hierarchical clustering, we show that the classic average-linkage algorithm gives a factor 2 approximation, and provide a simple and better algorithm that gives a factor 3-2 approximation. This aims at explaining the success of these heuristics in practice. Finally, we consider a 'beyond-worst-case' scenario through a generalisation of the stochastic block model for hierarchical clustering. We show that Dasgupta's cost function also has desirable properties for these inputs and we provide a simple algorithm that for graphs generated according to this model yields a 1 + o(1) factor approximation.

Original language	English
Title of host publication	Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms
Editors	Artur Czumaj
Publisher	Society for Industrial and Applied Mathematics
Publication date	2 Jan 2018
Pages	378-397
ISBN (Electronic)	978-1-61197-503-1
DOIs	https://doi.org/10.1137/1.9781611975031.26
Publication status	Published - 2 Jan 2018
Event	29th Annual ACM-SIAM Symposium on Discrete Algorithms - New Orleans, United States Duration: 7 Jan 2018 → 10 Jan 2018 Conference number: 29

Conference

Conference	29th Annual ACM-SIAM Symposium on Discrete Algorithms
Number	29
Country/Territory	United States
City	New Orleans
Period	07/01/2018 → 10/01/2018

Access to Document

10.1137/1.9781611975031.26

1.9781611975031.26Final published version, 548 KB

Cite this

Hierarchical Clustering : Objective Functions and Algorithms. / Cohen-addad, Vincent; Kanade, Varun; Mallmann-trenn, Frederik et al.

Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms. ed. / Artur Czumaj. Society for Industrial and Applied Mathematics, 2018. p. 378-397.

Research output: Chapter in Book/Report/Conference proceeding › Book chapter › Research › peer-review

Cohen-addad, V, Kanade, V, Mallmann-trenn, F & Mathieu, C 2018, Hierarchical Clustering: Objective Functions and Algorithms. in A Czumaj (ed.), Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms. Society for Industrial and Applied Mathematics, pp. 378-397, 29th Annual ACM-SIAM Symposium on Discrete Algorithms, New Orleans, Louisiana, United States, 07/01/2018. https://doi.org/10.1137/1.9781611975031.26

@inbook{b08b5aac0cb0475e8f1ac1af76283ac2,

title = "Hierarchical Clustering: Objective Functions and Algorithms",

abstract = "Hierarchical clustering is a recursive partitioning of a dataset into clusters at an increasingly finer granularity. Motivated by the fact that most work on hierarchical clustering was based on providing algorithms, rather than optimizing a specific objective, [19] framed similarity-based hierarchical clustering as a combinatorial optimization problem, where a 'good' hierarchical clustering is one that minimizes some cost function. He showed that this cost function has certain desirable properties, such as in order to achieve optimal cost, disconnected components must be separated first and that in 'structureless' graphs, i.e., cliques, all clusterings achieve the same cost. We take an axiomatic approach to defining 'good' objective functions for both similarity and dissimilarity-based hierarchical clustering. We characterize a set of admissible objective functions (that includes the one introduced by Dasgupta) that have the property that when the input admits a 'natural' ground-truth hierarchical clustering, the groundtruth clustering has an optimal value. Equipped with a suitable objective function, we analyze the performance of practical algorithms, as well as develop better and faster algorithms for hierarchical clustering. For similarity-based hierarchical clustering, [19] showed that a simple recursive sparsest-cut based approach achieves an Oplog3-2 nqapproximation on worst-case inputs. We give a more refined analysis of the algorithm and show that it in fact achieves an Op ? log nq-approximation1. This improves upon the LP-based Oplog nq-approximation of [33]. For dissimilarity-based hierarchical clustering, we show that the classic average-linkage algorithm gives a factor 2 approximation, and provide a simple and better algorithm that gives a factor 3-2 approximation. This aims at explaining the success of these heuristics in practice. Finally, we consider a 'beyond-worst-case' scenario through a generalisation of the stochastic block model for hierarchical clustering. We show that Dasgupta's cost function also has desirable properties for these inputs and we provide a simple algorithm that for graphs generated according to this model yields a 1 + o(1) factor approximation.",

author = "Vincent Cohen-addad and Varun Kanade and Frederik Mallmann-trenn and Claire Mathieu",

year = "2018",

month = jan,

day = "2",

doi = "10.1137/1.9781611975031.26",

language = "English",

pages = "378--397",

editor = "Artur Czumaj",

booktitle = "Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms",

publisher = "Society for Industrial and Applied Mathematics",

address = "United States",

note = "29th Annual ACM-SIAM Symposium on Discrete Algorithms ; Conference date: 07-01-2018 Through 10-01-2018",

}

TY - CHAP

T1 - Hierarchical Clustering

T2 - 29th Annual ACM-SIAM Symposium on Discrete Algorithms

AU - Cohen-addad, Vincent

AU - Kanade, Varun

AU - Mallmann-trenn, Frederik

AU - Mathieu, Claire

N1 - Conference code: 29

PY - 2018/1/2

Y1 - 2018/1/2

N2 - Hierarchical clustering is a recursive partitioning of a dataset into clusters at an increasingly finer granularity. Motivated by the fact that most work on hierarchical clustering was based on providing algorithms, rather than optimizing a specific objective, [19] framed similarity-based hierarchical clustering as a combinatorial optimization problem, where a 'good' hierarchical clustering is one that minimizes some cost function. He showed that this cost function has certain desirable properties, such as in order to achieve optimal cost, disconnected components must be separated first and that in 'structureless' graphs, i.e., cliques, all clusterings achieve the same cost. We take an axiomatic approach to defining 'good' objective functions for both similarity and dissimilarity-based hierarchical clustering. We characterize a set of admissible objective functions (that includes the one introduced by Dasgupta) that have the property that when the input admits a 'natural' ground-truth hierarchical clustering, the groundtruth clustering has an optimal value. Equipped with a suitable objective function, we analyze the performance of practical algorithms, as well as develop better and faster algorithms for hierarchical clustering. For similarity-based hierarchical clustering, [19] showed that a simple recursive sparsest-cut based approach achieves an Oplog3-2 nqapproximation on worst-case inputs. We give a more refined analysis of the algorithm and show that it in fact achieves an Op ? log nq-approximation1. This improves upon the LP-based Oplog nq-approximation of [33]. For dissimilarity-based hierarchical clustering, we show that the classic average-linkage algorithm gives a factor 2 approximation, and provide a simple and better algorithm that gives a factor 3-2 approximation. This aims at explaining the success of these heuristics in practice. Finally, we consider a 'beyond-worst-case' scenario through a generalisation of the stochastic block model for hierarchical clustering. We show that Dasgupta's cost function also has desirable properties for these inputs and we provide a simple algorithm that for graphs generated according to this model yields a 1 + o(1) factor approximation.

AB - Hierarchical clustering is a recursive partitioning of a dataset into clusters at an increasingly finer granularity. Motivated by the fact that most work on hierarchical clustering was based on providing algorithms, rather than optimizing a specific objective, [19] framed similarity-based hierarchical clustering as a combinatorial optimization problem, where a 'good' hierarchical clustering is one that minimizes some cost function. He showed that this cost function has certain desirable properties, such as in order to achieve optimal cost, disconnected components must be separated first and that in 'structureless' graphs, i.e., cliques, all clusterings achieve the same cost. We take an axiomatic approach to defining 'good' objective functions for both similarity and dissimilarity-based hierarchical clustering. We characterize a set of admissible objective functions (that includes the one introduced by Dasgupta) that have the property that when the input admits a 'natural' ground-truth hierarchical clustering, the groundtruth clustering has an optimal value. Equipped with a suitable objective function, we analyze the performance of practical algorithms, as well as develop better and faster algorithms for hierarchical clustering. For similarity-based hierarchical clustering, [19] showed that a simple recursive sparsest-cut based approach achieves an Oplog3-2 nqapproximation on worst-case inputs. We give a more refined analysis of the algorithm and show that it in fact achieves an Op ? log nq-approximation1. This improves upon the LP-based Oplog nq-approximation of [33]. For dissimilarity-based hierarchical clustering, we show that the classic average-linkage algorithm gives a factor 2 approximation, and provide a simple and better algorithm that gives a factor 3-2 approximation. This aims at explaining the success of these heuristics in practice. Finally, we consider a 'beyond-worst-case' scenario through a generalisation of the stochastic block model for hierarchical clustering. We show that Dasgupta's cost function also has desirable properties for these inputs and we provide a simple algorithm that for graphs generated according to this model yields a 1 + o(1) factor approximation.

U2 - 10.1137/1.9781611975031.26

DO - 10.1137/1.9781611975031.26

M3 - Book chapter

SP - 378

EP - 397

BT - Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms

A2 - Czumaj, Artur

PB - Society for Industrial and Applied Mathematics

Y2 - 7 January 2018 through 10 January 2018

ER -

Hierarchical Clustering: Objective Functions and Algorithms

Abstract

Conference

Access to Document

Fingerprint

Cite this