【发布时间】:2017-03-05 00:58:08
【问题描述】:
我正在使用 scikit-learn 来实现狄利克雷过程高斯混合模型:
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/mixture/dpgmm.py http://scikit-learn.org/stable/modules/generated/sklearn.mixture.BayesianGaussianMixture.html
也就是说,它是sklearn.mixture.BayesianGaussianMixture(),默认设置为weight_concentration_prior_type = 'dirichlet_process'。与 k-means 不同,在 k-means 中,用户先验地设置簇数“k”,DPGMM 是一个无限混合模型,Dirichlet 过程作为簇数的先验分布。
我的 DPGMM 模型始终将确切的簇数输出为n_components。正如这里所讨论的,处理这个问题的正确方法是使用predict(X)“减少冗余组件”:
Scikit-Learn's DPGMM fitting: number of components?
但是,链接到的示例实际上并未删除冗余组件并显示数据中“正确”的集群数量。相反,它只是绘制了正确数量的集群。
http://scikit-learn.org/stable/auto_examples/mixture/plot_gmm.html
用户如何实际删除冗余组件,并输出一个数组,这些组件应该包含哪些?这是删除冗余集群的“官方”/唯一方法吗?
这是我的代码:
>>> import pandas as pd
>>> import numpy as np
>>> import random
>>> from sklearn import mixture
>>> X = pd.read_csv(....) # my matrix
>>> X.shape
(20000, 48)
>>> dpgmm3 = mixture.BayesianGaussianMixture(n_components = 20, weight_concentration_prior_type='dirichlet_process', max_iter = 1000, verbose = 2)
>>> dpgmm3.fit(X) # Fitting the DPGMM model
>>> labels = dpgmm3.predict(X) # Generating labels after model is fitted
>>> max(labels)
>>> np.unique(labels) #Number of lab els == n_components specified above
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19])
#Trying with a different n_components
>>> dpgmm3_1 = mixture.BayesianGaussianMixture( weight_concentration_prior_type='dirichlet_process', max_iter = 1000) #not specifying n_components
>>> dpgmm3_1.fit(X)
>>> labels_1 = dpgmm3_1.predict(X)
>>> labels_1
array([0, 0, 0, ..., 0, 0, 0]) #All were classified under the same label
#Trying with n_components = 7
>>> dpgmm3_2 = mixture.BayesianGaussianMixture(n_components = 7, weight_concentration_prior_type='dirichlet_process', max_iter = 1000)
>>> dpgmm3_2.fit()
>>> labels_2 = dpgmm3_2.predict(X)
>>> np.unique(labels_2)
array([0, 1, 2, 3, 4, 5, 6]) #number of labels == n_components
【问题讨论】:
标签: python python-3.x machine-learning statistics scikit-learn