Review of Classic Clustering Algorithms

JinZhi Wang

文章目录

Abstract
Introduction：
Clustering Concept and Clustering Process

Clustering Concept
Clustering Process[^2]
Classical Clustering Algorithms

Experiment
Conclusion

Abstract

With the rapid development of the Internet, the explosive growth of user data is followed. Clustering algorithm plays a very important role in mining user preference features from the growing large data. This paper summarizes the research status of classical clustering algorithms in recent years, and briefly introduces prototype-based, hierarchical, density-based and model-based clustering methods. Some typical clustering algorithms and some well-known data sets are selected, and the clustering situations of different algorithms in the same data set are compared and analyzed.
Keywords: Data mining; Clustering; Algorithm

Introduction：

Cluster analysis has been a long time. Its research results not only provide a solid theoretical basis for the development of large data science, but also play an important role in pattern recognition, machine learning, image processing and other fields. It is worth mentioning that cluster analysis plays an important role in biology, psychology, archaeology, geology, geography and marketing.

Clustering Concept and Clustering Process

Clustering Concept

So far, there is no accepted definition of clustering in academia. The earliest definition of clustering was proposed by Everitt¹ in 1974: the entities in a cluster are similar, the points in the cluster are relatively compact in space, and the distance between any two objects in the cluster is less than that between any two objects in different clusters. A cluster can be described intuitively as a connected region of a relatively dense set of points in space. In practical applications, the clustering effect can not be accurately described by the definition. The optimal definition often depends on the nature of the clustering object and the clustering target.

Clustering Process²

Data preparation: Standardization and dimensionality reduction of data.
Feature selection: Clustering features are selected and stored in vectors.
Feature extraction: Clustering characteristics are processed to form clustering indicators.
Clustering: According to the extracted features, the approximation degree is measured, and then clustering is performed.
Results Assessment: To evaluate the clustering results of current clustering methods. There are three main types of evaluation: external effectiveness evaluation, internal effectiveness evaluation and relevance test evaluation.

Classical Clustering Algorithms

Classification of Clustering Algorithms
According to the object similarity measurement method used in clustering process of clustering algorithm, clustering algorithm can be divided into prototype clustering, hierarchical clustering, density clustering and model clustering.
Prototype clustering algorithm
Prototype clustering is also called “prototype-based clustering”. This kind of algorithm assumes that the clustering structure can be characterized by a set of prototypes. It is very common in real clustering tasks. In general, the prototype is initialized by the algorithm, and then the prototype is updated iteratively. Different prototype representations and different solutions will produce different algorithms. Here we introduce a well-known prototype clustering algorithm.
K-Means clustering algorithm ³, its core idea is to find K clustering centers, so that each data point and its nearest clustering center square distance sum is minimum. Its advantage is that it is much faster than hierarchical clustering algorithm, and it can classify large data sets efficiently; its disadvantage is that it is only suitable for numerical data clustering; it is only suitable for data sets whose clustering results are convex; and it usually terminates when a local optimal solution is obtained.
Hierarchical Clustering Algorithms
Hierarchical clustering algorithm, also known as tree clustering algorithm ⁴⁵, uses a tree-like hierarchical nesting method to aggregate or split clusters to construct a cluster tree. In the cluster tree, the leaf nodes are all sample data points, and the root nodes contain all cluster samples. In this paper, the aggregation strategy algorithm AGNES is introduced as an example. In AGNES clustering algorithm, a bottom-up aggregation strategy is adopted. First, all data points in the clustering sample set are regarded as an initial clustering cluster. During each iteration, the nearest two clusters are found and merged, and the process is repeated until the number of clusters reaches the preset number.
Density Clustering Algorithms⁶
Density clustering algorithm is different from traditional algorithm: it finds clusters of arbitrary shape in space by data sample density and has strong anti-noise ability. Following is a classical density-based clustering algorithm.
DBSCAN clustering algorithm is a well-known density-based clustering algorithm, which describes the compactness of sample data distribution based on a set of “neighborhood” parameters. Formally speaking, the DBSCAN algorithm defines the cluster as the largest set of closely connected samples derived from the density reachability relationship. The advantage of this method is that the clustering speed is fast and the ability of anti-noise is strong, and the clustering results can be found to be non-convex data sets.
Model Clustering Algorithms
Model clustering algorithm needs to construct a distribution model for each cluster that may exist in data samples, calculate model parameters through the distribution of real samples, and then use the model to cluster data. A classical model-based clustering algorithm is introduced.
Gauss Mixture Model Clustering (GMM)⁷, Using the Gauss distribution as the parameter model, the model parameters are calculated according to the existing sample data points, and then classified according to the model parameters. The process is repeated until the parameters change less than the expected value or the maximum number of iterations, the model training is completed. This algorithm can shield the influence of hidden variables on clustering results.

Experiment

Performance measurement methods
- Accuracy (ACC)
  Accuracy represents the proportion of the number of correctly classified data to the total number of data sets.Among them, K is the number of cluster, n is the number of sample data points, | Ci | is the number of sample points correctly classified into cluster Ci.
- Recall (RE)
  The recall rate calculates the ratio of the number of correctly classified clusters to the total number of clusters in each category.Among them, K is the number of clusters, | Ci | is the number of sample points correctly classified to the Ci cluster, | Ai | represents the number of data points belonging to the Ci cluster but incorrectly classified to other clusters.
The Effect of Density Clustering and Hierarchical Clustering
AGNES:

DBSCAN：eps=0.4,minPts=9

Conclusion

This paper introduces four clustering methods, including K-Means, AGNES, DBSCAN and GMM, which are representative algorithms of each type. For Iris data set, DBSCAN algorithm and ANGES algorithm are used to cluster Sepal attributes and Petal attributes respectively. Although clustering analysis has a long history and many excellent clustering algorithms have come out one after another, which has led to the vigorous development of related business areas, there are still enormous challenges in clustering problems.

References:

Jain AK, Dubes RC. Algorithms for Clustering Data. Prentice-Hall Advanced Reference Series, 1988. 1−334. ↩︎
孙吉贵,刘杰,赵连宇.聚类算法研究[J].软件学报,2008(01):48-61. ↩︎
Lan Goodfellow, Yoshua Bengio, Aaron Courville.DEEP LEARNING[M]. Massachusetts: MIT PRESS,2017:94-94. ↩︎
Marques JP, Written; Wu YF, Trans. Pattern Recognition Concepts, Methods and Applications. 2nd ed., Beijing: Tsinghua University Press, 2002. 51−74 (in Chinese). ↩︎
Fred ALN, Leitão JMN. Partitional vs hierarchical clustering using a minimum grammar complexity approach. In: Proc. of the SSPR&SPR 2000. LNCS 1876, 2000. 193−202. http://www.sigmod.org/dblp/db/conf/sspr/sspr2000.html. ↩︎
王玉晗,罗邓三郎.聚类算法综述[J].科技资讯,2018,16(24):10-11. ↩︎
周志华.机器学习[M].北京:清华大学出版社,2016:199-214. ↩︎