BIRCH – Wikipedia

Clustering using tree-based data aggregation

BIRCH (balanced iterative reducing and clustering using hierarchies) is an unsupervised data mining algorithm used to perform hierarchical clustering over particularly large data-sets.^[1] With modifications it can also be used to accelerate k-means clustering and Gaussian mixture modeling with the expectation–maximization algorithm.^[2] An advantage of BIRCH is its ability to incrementally and dynamically cluster incoming, multi-dimensional metric data points in an attempt to produce the best quality clustering for a given set of resources (memory and time constraints). In most cases, BIRCH only requires a single scan of the database.

Its inventors claim BIRCH to be the “first clustering algorithm proposed in the database area to handle ‘noise’ (data points that are not part of the underlying pattern) effectively”,^[1] beating DBSCAN by two months. The BIRCH algorithm received the SIGMOD 10 year test of time award in 2006.^[3]

Table of Contents

Problem with previous methods[edit]

Previous clustering algorithms performed less effectively over very large databases and did not adequately consider the case wherein a data-set was too large to fit in main memory. As a result, there was a lot of overhead maintaining high clustering quality while minimizing the cost of additional IO (input/output) operations. Furthermore, most of BIRCH’s predecessors inspect all data points (or all currently existing clusters) equally for each ‘clustering decision’ and do not perform heuristic weighting based on the distance between these data points.

Advantages with BIRCH[edit]

It is local in that each clustering decision is made without scanning all data points and currently existing clusters.
It exploits the observation that the data space is not usually uniformly occupied and not every data point is equally important.
It makes full use of available memory to derive the finest possible sub-clusters while minimizing I/O costs.
It is also an incremental method that does not require the whole data set in advance.

Algorithm[edit]

The BIRCH algorithm takes as input a set of $N$ data points, represented as real-valued vectors, and a desired number of clusters $K$ . It operates in four phases, the second of which is optional.

The first phase builds a clustering feature (

{displaystyle CF}

$CF$ ) tree out of the data points, a height-balanced tree data structure, defined as follows:

Given a set of N d-dimensional data points, the clustering feature
CF{displaystyle CF}
$CF$ of the set is defined as the triple
CF=(N,LS→,SS){displaystyle CF=(N,{overrightarrow {LS}},SS)}
${displaystyle CF=(N,{overrightarrow {LS}},SS)}$ , where
- ${displaystyle {overrightarrow {LS}}=sum _{i=1}^{N}{overrightarrow {X_{i}}}}$
  ${displaystyle {overrightarrow {LS}}=sum _{i=1}^{N}{overrightarrow {X_{i}}}}$ is the linear sum.
- ${displaystyle SS=sum _{i=1}^{N}({overrightarrow {X_{i}}})^{2}}$
  ${displaystyle SS=sum _{i=1}^{N}({overrightarrow {X_{i}}})^{2}}$ is the square sum of data points.
Clustering features are organized in a CF tree, a height-balanced tree with two parameters:^{[clarification needed]}branching factor ${displaystyle B}$

In the second step, the algorithm scans all the leaf entries in the initial

{displaystyle CF}

$CF$ tree to rebuild a smaller

{displaystyle CF}

$CF$ tree, while removing outliers and grouping crowded subclusters into larger ones. This step is marked optional in the original presentation of BIRCH.

In step three an existing clustering algorithm is used to cluster all leaf entries. Here an agglomerative hierarchical clustering algorithm is applied directly to the subclusters represented by their

{displaystyle CF}

$CF$ vectors. It also provides the flexibility of allowing the user to specify either the desired number of clusters or the desired diameter threshold for clusters. After this step a set of clusters is obtained that captures major distribution pattern in the data. However, there might exist minor and localized inaccuracies which can be handled by an optional step 4. In step 4 the centroids of the clusters produced in step 3 are used as seeds and redistribute the data points to its closest seeds to obtain a new set of clusters. Step 4 also provides us with an option of discarding outliers. That is a point which is too far from its closest seed can be treated as an outlier.

Calculations with the clustering features[edit]

Given only the clustering feature

{displaystyle CF=[N,{overrightarrow {LS}},SS]}

${displaystyle CF=[N,{overrightarrow {LS}},SS]}$ , the same measures can be calculated without the knowledge of the underlying actual values.

In multidimensional cases the square root should be replaced with a suitable norm.

Numerical issues in BIRCH clustering features[edit]

Unfortunately, there are numerical issues associated with the use of the term

{displaystyle SS}

$SS$ in BIRCH. When subtracting

{displaystyle {frac {SS}{N}}-{big (}{frac {vec {LS}}{N}}{big )}^{2}}

${displaystyle {frac {SS}{N}}-{big (}{frac {vec {LS}}{N}}{big )}^{2}}$ or similar in the other distances such as

{displaystyle D_{2}}

$D_{2}$ , catastrophic cancellation can occur and yield a poor precision, and which can in some cases even cause the result to be negative (and the square root then become undefined).^[2] This can be resolved by using BETULA cluster features

{displaystyle CF=(N,mu ,S)}

${displaystyle CF=(N,mu ,S)}$ instead, which store the count

{displaystyle N}

$N$ , mean

{displaystyle mu }

$mu$ , and sum of squared deviations instead based on numerically more reliable online algorithms to calculate variance. For these features, a similar additivity theorem holds. When storing a vector respectively a matrix for the squared deviations, the resulting BIRCH CF-tree can also be used to accelerate Gaussian Mixture Modeling with the expectation–maximization algorithm, besides k-means clustering and hierarchical agglomerative clustering.

^ ^a ^b Zhang, T.; Ramakrishnan, R.; Livny, M. (1996). “BIRCH: an efficient data clustering method for very large databases”. Proceedings of the 1996 ACM SIGMOD international conference on Management of data – SIGMOD ’96. pp. 103–114. doi:10.1145/233269.233324.
^ ^a ^b Lang, Andreas; Schubert, Erich (2020), “BETULA: Numerically Stable CF-Trees for BIRCH Clustering”, Similarity Search and Applications, pp. 281–296, arXiv:2006.12881, doi:10.1007/978-3-030-60936-8_22, ISBN 978-3-030-60935-1, S2CID 219980434, retrieved 2021-01-16
^ “2006 SIGMOD Test of Time Award”. Archived from the original on 2010-05-23.

[birch-1] Zhang, T.; Ramakrishnan, R.; Livny, M. (1996). “BIRCH: an efficient data clustering method for very large databases”. Proceedings of the 1996 ACM SIGMOD international conference on Management of data – SIGMOD ’96. pp. 103–114. doi:10.1145/233269.233324.

[:0-2] Lang, Andreas; Schubert, Erich (2020), “BETULA: Numerically Stable CF-Trees for BIRCH Clustering”, Similarity Search and Applications, pp. 281–296, arXiv:2006.12881, doi:10.1007/978-3-030-60936-8_22, ISBN 978-3-030-60935-1, S2CID 219980434, retrieved 2021-01-16

[3] “2006 SIGMOD Test of Time Award”. Archived from the original on 2010-05-23.

BIRCH – Wikipedia

Problem with previous methods[edit]

Advantages with BIRCH[edit]

Algorithm[edit]

Calculations with the clustering features[edit]

Numerical issues in BIRCH clustering features[edit]

Recent Posts

Recent Comments

Archives

Categories

Meta