机器学习(三)k均值聚类

k均值聚类

原文地址http://blog.csdn.net/hjimce/article/details/45200985

作者:hjimce

高斯混合模型和k均值聚类是聚类算法中的两种比较常用而简单的算法,这里先介绍k均值聚类算法。

一、K-means算法理论简介

K-means算法是硬聚类算法,是典型的基于原型的目标函数聚类方法的代表,它是数据点到原型的某种距离作为优化的目标函数,利用函数求极值的方法得到迭代运算的调整规则。K-means算法以欧式距离作为相似度测度,它是求对应某一初始聚类中心向量V最优分类,使得评价指标J最小。算法采用误差平方和准则函数作为聚类准则函数,总之就是要使得下面的公式最小化:

k均值聚类

算法过程如下:
1)从N个文档随机选取K个文档作为质心
2)对剩余的每个文档测量其到每个质心的距离,并把它归到最近的质心的类
3)重新计算已经得到的各个类的质心
4)迭代2~3步直至新的质心与原质心相等或小于指定阈值,算法结束

二、K-means算法实现

K均值聚类算法实现分为四个步骤,设数据集为二维data,用matlab把数据绘制出来,如下所示:

绘制代码:

k均值聚类

绘制结果:

 k均值聚类

现在假设要把该数据集分为4类,算法步骤如下:

1) 初始化聚类中心

由于聚类个数选择4,因此我们用matlab随机生成4个不重复的整数,且大小不超过数据点的个数,得到初始聚类中心A,B,C,D

2) 计算每个点分别到4个聚类中心的距离,找出最小的那个。假设点p为数据集中的点,求出ABCD中距离p点最近的那个点,设B距离P最近,那么就把p点聚类为B类。

3) 更新聚类中心。根据步骤2可得数据集的聚类结果,根据聚类结果,计算每个类的重心位置,作为更新的聚类中心。然后返回步骤2,重新进行聚类,如此循环步骤2与步骤3,直到迭代收敛。

最后贴一下代码:

close all; clear; clc; % %生成高斯分布随机数1 % mu = [2 3]; % SIGMA = [1 0; 0 2]; % r1 = mvnrnd(mu,SIGMA,100); % plot(r1(:,1),r1(:,2),'r+'); % hold on; % %生成高斯分布随机数1 % mu = [7 8]; % SIGMA = [ 1 0; 0 2]; % r2 = mvnrnd(mu,SIGMA,100); % plot(r2(:,1),r2(:,2),'*') % % data=[r1;r2]

data=importdata(‘data.txt’);
% 算法流程
figure(1);
plot(data(:,1),data(:,2),’*’);
figure(2);
[m n]=size(data);
%聚类个数为4,则4个不重复的整数
p=randperm(m);
kn=4;
d=p(1:kn);
center(1:kn,:)=data(d(????????;
flag=zeros(1,m);
%计算每个点到4个质心点的最近的那个点
it=1;
while(it<30 for="" ii=“1:m;” flag="" ii="" -1="" mindist=“inf;” for="" jj=“1:kn;” dst=“norm(data(ii,:)-center(jj,:));” if="" mindist="">dst;
mindist=dst;
flag(ii)=jj;
end
end
end
%更新聚类中心
center=zeros(size(center));
countflag=zeros(1,kn);
for ii=1:m
for jj=1:kn
if(flag(ii)==jj)
center(jj,:)=center(jj,:)+data(ii,:);
countflag(jj)=countflag(jj)+1;
end
end
end
for jj=1:kn
center(jj,:)=center(jj,:)./countflag(jj);
end
it=it+1;
end
hold on;
for i=1:m;
if(flag(i)==1);
plot(data(i,1),data(i,2),’.y’);
elseif flag(i)==2
plot(data(i,1),data(i,2),’.b’);
elseif (flag(i)==3)
plot(data(i,1),data(i,2),’.k’);
elseif (flag(i)==4)
plot(data(i,1),data(i,2),’.r’);
end
end</30>

聚类结果如下:

k均值聚类


opencv版kmeans:
声明:
  1. enum
  2. {
  3. KMEANS_RANDOM_CENTERS=0, // Chooses random centers for k-Means initialization
  4. KMEANS_PP_CENTERS=2, // Uses k-Means++ algorithm for initialization
  5. KMEANS_USE_INITIAL_LABELS=1 // Uses the user-provided labels for K-Means initialization
  6. };
  7. //! clusters the input data using k-Means algorithm
  8. CV_EXPORTS_W double kmeans( InputArray data, int K, CV_OUT InputOutputArray bestLabels,
  9. TermCriteria criteria, int attempts,
  10. int flags, OutputArray centers=noArray() );
实现函数:
  1. double kmeans( const Mat& data, int K, Mat& best_labels,
  2. TermCriteria criteria, int attempts,
  3. int flags, Mat* _centers )
  4. {
  5. const int SPP_TRIALS = 3;
  6. int N = data.rows > 1 ? data.rows : data.cols;
  7. int dims = (data.rows > 1 ? data.cols : 1)*data.channels();
  8. int type = data.depth();
  9. bool simd = checkHardwareSupport(CV_CPU_SSE);
  10. attempts = std::max(attempts, 1);
  11. CV_Assert( type == CV_32F && K > 0 );
  12. Mat _labels;
  13. if( flags & CV_KMEANS_USE_INITIAL_LABELS )
  14. {
  15. CV_Assert( (best_labels.cols == 1 || best_labels.rows == 1) &&
  16. best_labels.cols*best_labels.rows == N &&
  17. best_labels.type() == CV_32S &&
  18. best_labels.isContinuous());
  19. best_labels.copyTo(_labels);
  20. }
  21. else
  22. {
  23. if( !((best_labels.cols == 1 || best_labels.rows == 1) &&
  24. best_labels.cols*best_labels.rows == N &&
  25. best_labels.type() == CV_32S &&
  26. best_labels.isContinuous()))
  27. best_labels.create(N, 1, CV_32S);
  28. _labels.create(best_labels.size(), best_labels.type());
  29. }
  30. int* labels = _labels.ptr<int>();
  31. Mat centers(K, dims, type), old_centers(K, dims, type);
  32. vector<int> counters(K);
  33. vector<Vec2f> _box(dims);
  34. Vec2f* box = &_box[0];
  35. double best_compactness = DBL_MAX, compactness = 0;
  36. RNG& rng = theRNG();
  37. int a, iter, i, j, k;
  38. if( criteria.type & TermCriteria::EPS )
  39. criteria.epsilon = std::max(criteria.epsilon, 0.);
  40. else
  41. criteria.epsilon = FLT_EPSILON;
  42. criteria.epsilon *= criteria.epsilon;
  43. if( criteria.type & TermCriteria::COUNT )
  44. criteria.maxCount = std::min(std::max(criteria.maxCount, 2), 100);
  45. else
  46. criteria.maxCount = 100;
  47. if( K == 1 )
  48. {
  49. attempts = 1;
  50. criteria.maxCount = 2;
  51. }
  52. const float* sample = data.ptr<float>(0);
  53. for( j = 0; j < dims; j++ )
  54. box[j] = Vec2f(sample[j], sample[j]);
  55. for( i = 1; i < N; i++ )
  56. {
  57. sample = data.ptr<float>(i);
  58. for( j = 0; j < dims; j++ )
  59. {
  60. float v = sample[j];
  61. box[j][0] = std::min(box[j][0], v);
  62. box[j][1] = std::max(box[j][1], v);
  63. }
  64. }
  65. for( a = 0; a < attempts; a++ )
  66. {
  67. double max_center_shift = DBL_MAX;
  68. for( iter = 0; iter < criteria.maxCount && max_center_shift > criteria.epsilon; iter++ )
  69. {
  70. swap(centers, old_centers);
  71. if( iter == 0 && (a > 0 || !(flags & KMEANS_USE_INITIAL_LABELS)) )
  72. {
  73. if( flags & KMEANS_PP_CENTERS )
  74. generateCentersPP(data, centers, K, rng, SPP_TRIALS);
  75. else
  76. {
  77. for( k = 0; k < K; k++ )
  78. generateRandomCenter(_box, centers.ptr<float>(k), rng);
  79. }
  80. }
  81. else
  82. {
  83. if( iter == 0 && a == 0 && (flags & KMEANS_USE_INITIAL_LABELS) )
  84. {
  85. for( i = 0; i < N; i++ )
  86. CV_Assert( (unsigned)labels[i] < (unsigned)K );
  87. }
  88. // compute centers
  89. centers = Scalar(0);
  90. for( k = 0; k < K; k++ )
  91. counters[k] = 0;
  92. for( i = 0; i < N; i++ )
  93. {
  94. sample = data.ptr<float>(i);
  95. k = labels[i];
  96. float* center = centers.ptr<float>(k);
  97. for( j = 0; j <= dims - 4; j += 4 )
  98. {
  99. float t0 = center[j] + sample[j];
  100. float t1 = center[j+1] + sample[j+1];
  101. center[j] = t0;
  102. center[j+1] = t1;
  103. t0 = center[j+2] + sample[j+2];
  104. t1 = center[j+3] + sample[j+3];
  105. center[j+2] = t0;
  106. center[j+3] = t1;
  107. }
  108. for( ; j < dims; j++ )
  109. center[j] += sample[j];
  110. counters[k]++;
  111. }
  112. if( iter > 0 )
  113. max_center_shift = 0;
  114. for( k = 0; k < K; k++ )
  115. {
  116. float* center = centers.ptr<float>(k);
  117. if( counters[k] != 0 )
  118. {
  119. float scale = 1.f/counters[k];
  120. for( j = 0; j < dims; j++ )
  121. center[j] *= scale;
  122. }
  123. else
  124. generateRandomCenter(_box, center, rng);
  125. if( iter > 0 )
  126. {
  127. double dist = 0;
  128. const float* old_center = old_centers.ptr<float>(k);
  129. for( j = 0; j < dims; j++ )
  130. {
  131. double t = center[j] - old_center[j];
  132. dist += t*t;
  133. }
  134. max_center_shift = std::max(max_center_shift, dist);
  135. }
  136. }
  137. }
  138. // assign labels
  139. compactness = 0;
  140. for( i = 0; i < N; i++ )
  141. {
  142. sample = data.ptr<float>(i);
  143. int k_best = 0;
  144. double min_dist = DBL_MAX;
  145. for( k = 0; k < K; k++ )
  146. {
  147. const float* center = centers.ptr<float>(k);
  148. double dist = distance(sample, center, dims, simd);
  149. if( min_dist > dist )
  150. {
  151. min_dist = dist;
  152. k_best = k;
  153. }
  154. }
  155. compactness += min_dist;
  156. labels[i] = k_best;
  157. }
  158. }
  159. if( compactness < best_compactness )
  160. {
  161. best_compactness = compactness;
  162. if( _centers )
  163. centers.copyTo(*_centers);
  164. _labels.copyTo(best_labels);
  165. }
  166. }
  167. return best_compactness;
  168. }
  169. }

调用方法:
  1. const int kMeansItCount = 10; //迭代次数
  2. const int kMeansType = cv::KMEANS_PP_CENTERS; //Use kmeans++ center initialization by Arthur and Vassilvitskii
  3. cv::Mat bgdLabels, fgdLabels; //记录背景和前景的像素样本集中每个像素对应GMM的哪个高斯模型,论文中的kn
  4. //kmeans中参数_bgdSamples为:每行一个样本
  5. //kmeans的输出为bgdLabels,里面保存的是输入样本集中每一个样本对应的类标签(样本聚为componentsCount类后)
  6. kmeans( _fgdSamples, GMM::componentsCount, fgdLabels,
  7. cv::TermCriteria( CV_TERMCRIT_ITER, kMeansItCount, 0.0), 0, kMeansType );

************作者:hjimce     联系qq:1393852684 更多资源请关注我的博客:http://blog.csdn.net/hjimce                  原创文章,版权所有,转载请保留这两行信息****************




相关文章: