scikit-learn - Clustering

scikit-learn - Machine Learning in Python
https://scikit-learn.org/stable/

scikit-learn - github
https://github.com/scikit-learn/scikit-learn

Clustering of unlabeled data can be performed with the module sklearn.cluster.

Each clustering algorithm comes in two variants: a class, that implements the fit method to learn the clusters on train data, and a function, that, given train data, returns an array of integer labels corresponding to the different clusters. For the class, the labels over the training data can be found in the labels_ attribute.

Input data

One important thing to note is that the algorithms implemented in this module can take different kinds of matrix as input. All the methods accept standard data matrices of shape [n_samples, n_features]. These can be obtained from the classes in the sklearn.feature_extraction module. For AffinityPropagation, SpectralClustering and DBSCAN one can also input similarity matrices of shape [n_samples, n_samples]. These can be obtained from the functions in the sklearn.metrics.pairwise module.

Overview of clustering methods

scikit-learn - Clustering
A comparison of the clustering algorithms in scikit-learn

scikit-learn - Clustering

Non-flat geometry clustering is useful when the clusters have a specific shape, i.e. a non-flat manifold, and the standard euclidean distance is not the right metric. This case arises in the two top rows of the figure above.

Gaussian mixture models, useful for clustering, are described in another chapter of the documentation dedicated to mixture models. KMeans can be seen as a special case of Gaussian mixture model with equal covariance per component.

geometry [dʒɪ'ɒmɪtrɪ]:n. 几何学
manifold ['mænɪfəʊld]:vt. 复写,复印,增多,使......多样化 adj. 多方面的,有许多部分的,各式各样的 n. 多种,复印本
euclidean:adj. 欧几里德的,欧几里德几何学的
metric ['metrɪk]:adj. 公制的,米制的,公尺的 n. 度量标准
covariance [kəʊ'veərɪəns]:n. 协方差,共分散
mixture ['mɪkstʃə]:n. 混合,混合物,混合剂
dedicate ['dedɪkeɪt]:vt. 致力,献身,题献
component [kəm'pəʊnənt]:adj. 组成的,构成的 n. 成分,组件,元件
affinity [ə'fɪnɪtɪ]:n. 密切关系,吸引力,姻亲关系,类同
propagation [,prɒpə'ɡeɪʃən]:n. 传播,繁殖,增殖
spectral ['spektr(ə)l]:adj. 光谱的,幽灵的,鬼怪的
hierarchical [haɪə'rɑːkɪk(ə)l]:adj. 分层的,等级体系的
birch [bɜːtʃ]:n. 桦木,桦树,桦条 vt. 用桦条鞭打
agglomerative [ə'glɑmə,retɪv]:adj. 会凝聚的,烧结的,凝结的
ward [wɔːd]:n. 病房,保卫,监视 vt. 避开,保卫,守护
linkage ['lɪŋkɪdʒ]:n. 连接,结合,联接,联动装置
Mahalanobis distance:马氏距离

相关文章: