聚类分析算法概述_聚类算法-概述

聚类分析算法概述

聚类算法-概述 (Clustering Algorithms - Overview)

集群介绍 (Introduction to Clustering)

Clustering methods are one of the most useful unsupervised ML methods. These methods are used to find similarity as well as the relationship patterns among data samples and then cluster those samples into groups having similarity based on features.

聚类方法是最有用的无监督ML方法之一。这些方法用于查找数据样本之间的相似性以及关系模式，然后基于特征将这些样本聚类为具有相似性的组。

Clustering is important because it determines the intrinsic grouping among the present unlabeled data. They basically make some assumptions about data points to constitute their similarity. Each assumption will construct different but equally valid clusters.

聚类很重要，因为它决定了当前未标记数据之间的固有分组。他们基本上对数据点进行一些假设以构成它们的相似性。每个假设将构建不同但有效的集群。

For example, below is the diagram which shows clustering system grouped together the similar kind of data in different clusters −

例如，以下是显示集群系统的图，该集群系统将不同集群中的同类数据分组在一起-

团簇形成方法 (Cluster Formation Methods)

It is not necessary that clusters will be formed in spherical form. Followings are some other cluster formation methods −

簇不必形成球形。以下是其他一些集群形成方法-

基于密度 (Density-based)

In these methods, the clusters are formed as the dense region. The advantage of these methods is that they have good accuracy as well as good ability to merge two clusters. Ex. Density-Based Spatial Clustering of Applications with Noise (DBSCAN), Ordering Points to identify Clustering structure (OPTICS) etc.

在这些方法中，簇形成为密集区域。这些方法的优点是它们具有良好的准确性以及合并两个聚类的良好能力。例如带噪声的应用程序的基于密度的空间聚类(DBSCAN)，识别聚类结构的订购点(OPTICS)等。

基于层次的 (Hierarchical-based)

In these methods, the clusters are formed as a tree type structure based on the hierarchy. They have two categories namely, Agglomerative (Bottom up approach) and Divisive (Top down approach). Ex. Clustering using Representatives (CURE), Balanced iterative Reducing Clustering using Hierarchies (BIRCH) etc.

在这些方法中，群集基于层次结构形成为树型结构。它们有两类，即凝聚(自下而上的方法)和分裂(自上而下的方法)。例如使用代表进行聚类(CURE)，使用层次结构进行平衡的减少迭代聚类(BIRCH)等。

分区 (Partitioning)

In these methods, the clusters are formed by portioning the objects into k clusters. Number of clusters will be equal to the number of partitions. Ex. K-means, Clustering Large Applications based upon randomized Search (CLARANS).

在这些方法中，通过将对象分成k个簇来形成簇。群集数将等于分区数。例如 K均值，基于随机搜索(CLARANS)对大型应用程序进行聚类。

格 (Grid)

In these methods, the clusters are formed as a grid like structure. The advantage of these methods is that all the clustering operation done on these grids are fast and independent of the number of data objects. Ex. Statistical Information Grid (STING), Clustering in Quest (CLIQUE).

在这些方法中，簇形成为网格状结构。这些方法的优点是在这些网格上完成的所有聚类操作都是快速的，并且与数据对象的数量无关。例如统计信息网格(STING)，任务中的聚类(CLIQUE)。

衡量集群性能 (Measuring Clustering Performance)

One of the most important consideration regarding ML model is assessing its performance or you can say model’s quality. In case of supervised learning algorithms, assessing the quality of our model is easy because we already have labels for every example.

关于ML模型的最重要考虑因素之一是评估其性能，或者可以说模型的质量。在监督学习算法的情况下，评估模型的质量很容易，因为我们已经为每个示例添加了标签。

On the other hand, in case of unsupervised learning algorithms we are not that much blessed because we deal with unlabeled data. But still we have some metrics that give the practitioner an insight about the happening of change in clusters depending on algorithm.

另一方面，在无监督学习算法的情况下，我们没有那么幸运，因为我们处理的是未标记数据。但是，我们仍然有一些度量标准可以使从业者了解根据算法在集群中发生的变化。

Before we deep dive into such metrics, we must understand that these metrics only evaluates the comparative performance of models against each other rather than measuring the validity of the model’s prediction. Followings are some of the metrics that we can deploy on clustering algorithms to measure the quality of model −

在深入研究此类指标之前，我们必须了解，这些指标仅评估模型之间的比较性能，而不是评估模型预测的有效性。以下是我们可以在聚类算法上部署以衡量模型质量的一些指标-

轮廓分析 (Silhouette Analysis)

Silhouette analysis used to check the quality of clustering model by measuring the distance between the clusters. It basically provides us a way to assess the parameters like number of clusters with the help of Silhouette score. This score measures how close each point in one cluster is to points in the neighboring clusters.

轮廓分析用于通过测量聚类之间的距离来检查聚类模型的质量。它基本上为我们提供了一种利用Silhouette得分评估参数(如聚类数)的方法。此分数衡量一个群集中的每个点与相邻群集中的点的接近程度。

轮廓分数分析 (Analysis of Silhouette Score)

The range of Silhouette score is [-1, 1]. Its analysis is as follows −

Silhouette得分的范围是[-1，1]。其分析如下-

+1 Score − Near +1 Silhouette score indicates that the sample is far away from its neighboring cluster.

+1分数 -接近+1 轮廓分数表示样本距离其邻近簇很远。
0 Score − 0 Silhouette score indicates that the sample is on or very close to the decision boundary separating two neighboring clusters.

0得分 − 0 轮廓得分表示样本位于或非常接近分隔两个相邻聚类的决策边界上。
-1 Score &minusl -1 Silhouette score indicates that the samples have been assigned to the wrong clusters.

-1分数和负-1 轮廓分数表示样本已分配给错误的聚类。

The calculation of Silhouette score can be done by using the following formula −

Silhouette得分的计算可以使用以下公式进行：

???????????????????????????????????????? ????????????????????=(????−????)/???????????? (????,????)

????????????????????=(????−????)/????????????(????，????)

Here, ???? = mean distance to the points in the nearest cluster

????=到最近簇的点的平均距离

And, ???? = mean intra-cluster distance to all the points.

并且，????=到所有点的平均集群内距离。

戴维斯-布尔丁指数 (Davis-Bouldin Index)

DB index is another good metric to perform the analysis of clustering algorithms. With the help of DB index, we can understand the following points about clustering model −

DB索引是执行聚类算法分析的另一个很好的指标。借助数据库索引，我们可以了解有关聚类模型的以下几点：

Weather the clusters are well-spaced from each other or not?

天气群集之间的间距是否合适？
How much dense the clusters are?

这些簇有多少密度？

We can calculate DB index with the help of following formula −

$$DB=\frac{1}{n}\displaystyle\sum\limits_{i=1}^n max_{j\neq{i}}\left(\frac{\sigma_{i}+\sigma_{j}}{d(c_{i},c_{j})}\right)$$

我们可以借助以下公式计算数据库索引-

$$ DB = \ frac {1} {n} \ displaystyle \ sum \ limits_ {i = 1} ^ n max_ {j \ neq {i}} \ left(\ frac {\ sigma_ {i} + \ sigma_ {j }} {d(c_ {i}，c_ {j})} \ right)$$

Here, ???? = number of clusters

????=簇数

σ_i = average distance of all points in cluster ???? from the cluster centroid ????????.

_σI =所有点的平均距离在簇????从群集重心????????。

Less the DB index, better the clustering model is.

DB索引越少，集群模型越好。

邓恩指数 (Dunn Index)

It works same as DB index but there are following points in which both differs −

它的工作原理与数据库索引相同，但是在以下几点上，两者有所不同-

The Dunn index considers only the worst case i.e. the clusters that are close together while DB index considers dispersion and separation of all the clusters in clustering model.

Dunn索引仅考虑最坏的情况，即靠近在一起的集群，而DB索引考虑聚类模型中所有集群的分散和分离。
Dunn index increases as the performance increases while DB index gets better when clusters are well-spaced and dense.

Dunn索引随着性能的提高而增加，而当群集间隔适当且密集时，DB索引会变得更好。

We can calculate Dunn index with the help of following formula −

$$D=\frac{min_{1\leq i <{j}\leq{n}}P(i,j)}{mix_{1\leq i < k \leq n}q(k)}$$

我们可以借助以下公式来计算Dunn指数-

$$ D = \ frac {min_ {1 \ leq i <{j} \ leq {n}} P(i，j)} {mix_ {1 \ leq i <k \ leq n} q(k)} $$

Here, ????,????,???? = each indices for clusters

,,????，????=聚类的每个索引

???? = inter-cluster distance

????=集群间距离

q = intra-cluster distance

q =集群内距离

ML聚类算法的类型 (Types of ML Clustering Algorithms)

The following are the most important and useful ML clustering algorithms −

以下是最重要和最有用的ML聚类算法-

K均值聚类 (K-means Clustering)

This clustering algorithm computes the centroids and iterates until we it finds optimal centroid. It assumes that the number of clusters are already known. It is also called flat clustering algorithm. The number of clusters identified from data by algorithm is represented by ‘K’ in K-means.

该聚类算法计算质心并进行迭代，直到找到最佳质心为止。它假定群集的数目是已知的。它也称为平面聚类算法。通过算法从数据中识别出的聚类数量以K均值中的“ K”表示。

均值漂移算法 (Mean-Shift Algorithm)

It is another powerful clustering algorithm used in unsupervised learning. Unlike K-means clustering, it does not make any assumptions hence it is a non-parametric algorithm.

它是在无监督学习中使用的另一种强大的聚类算法。与K均值聚类不同，它没有做任何假设，因此它是一种非参数算法。

层次聚类 (Hierarchical Clustering)

It is another unsupervised learning algorithm that is used to group together the unlabeled data points having similar characteristics.

这是另一种无监督的学习算法，用于将具有相似特征的未标记数据点分组在一起。

We will be discussing all these algorithms in detail in the upcoming chapters.

在接下来的章节中，我们将详细讨论所有这些算法。

聚类的应用 (Applications of Clustering)

We can find clustering useful in the following areas −

我们发现聚类在以下领域很有用-

Data summarization and compression − Clustering is widely used in the areas where we require data summarization, compression and reduction as well. The examples are image processing and vector quantization.

数据汇总和压缩 -群集也广泛用于我们需要数据汇总，压缩和缩减的领域。示例是图像处理和矢量量化。

Collaborative systems and customer segmentation − Since clustering can be used to find similar products or same kind of users, it can be used in the area of collaborative systems and customer segmentation.

协作系统和客户细分 -由于群集可用于查找相似的产品或相同类型的用户，因此可将其用于协作系统和客户细分领域。

Serve as a key intermediate step for other data mining tasks − Cluster analysis can generate a compact summary of data for classification, testing, hypothesis generation; hence, it serves as a key intermediate step for other data mining tasks also.

用作其他数据挖掘任务的关键中间步骤 -聚类分析可以生成数据的紧凑摘要，以进行分类，测试和假设生成；因此，它也是其他数据挖掘任务的关键中间步骤。

Trend detection in dynamic data − Clustering can also be used for trend detection in dynamic data by making various clusters of similar trends.

动态数据中的趋势检测 -通过制作相似趋势的各种群集，聚类也可以用于动态数据中的趋势检测。

Social network analysis − Clustering can be used in social network analysis. The examples are generating sequences in images, videos or audios.

社交网络分析 -聚类可用于社交网络分析。示例是在图像，视频或音频中生成序列。

Biological data analysis − Clustering can also be used to make clusters of images, videos hence it can successfully be used in biological data analysis.

生物数据分析 -聚类还可以用于将图像和视频进行聚类，因此可以成功地用于生物数据分析。

Previous Page Print Page

上一页打印页面