cluster by fast search and find of density peaks

说实在的这篇文章一发出来就遭到很多同行的鄙视，具体可以见Science上的评论，作为小菜，我还是保持沉默，好好学习为好。今天终于挤出时间来写这篇文章，闲话少说，直接上干货吧。

聚类是根据距离来进行的，关于距离也可以用一篇很长的文章来叙述，但是这篇文章不是关注这个方面，我们只关注有了距离d_ij以后，怎么更好的进行聚类，首先说一下这篇文章的优点吧：

a) 简单，里边不包含任何高深的迭代、梯度、隐函数等高深的词汇，基本上是个工科学生就能看懂；

b) 效率高，因为没有上述的过程，所以处理起来较为简单；

c) 能用对多种复杂的数据;

d) 对类别比较多时不敏感，这一点是很多经典的方法比不了的（kmean）。

这是我发现的优点，既然优点这么多，我们来具体看一下怎么实现的吧：

1. 计算局部密度

用d_ij代表数据点i和j之间的距离，i点的密度计算公式为

cluster by fast search and find of density peaks

上式中，X为一个符号函数，定义为：X(x)=1 如果x<0，否则X(x)=0。

关于d_c的确定，文章指出，d_c可以选择为平均每个点的邻居局部密度为数据总数的1-2%（具体做实验时还得调整，这可能不是特别让人满意）

代码如下：

 1 function y=get_P(dis,dc,label)
 2 [m,n]=size(dis);
 3 y=zeros(m,1);
 4 for i=1:m
 5     y(i)=0;
 6    for j=1:m
 7       if(i~=j)
 8           if label==0
 9              if(dis(i,j)-dc<0)
10                  y(i)=y(i)+1;
11              end    
12           else
13               y(i)=y(i)+exp(-(dis(i,j)/dc)^2);
14           end
15       end      
16    end    
17 end

确定d_c

 1 function y=get_dc(dis,percentage,kernel)
 2 max_dis=max(max(dis))/2;
 3 [m,n]=size(dis);
 4 i=0;
 5 for i=0:max_dis/1000:max_dis
 6     p=get_P(dis,i,kernel);
 7     p_ave=sum(p)/m;
 8     if(p_ave>m*percentage)
 9         break;
10     end
11 end
12 y=i;
13     
14

2. 计算局部密度比本身大且距离本身最小的点

这个主要用于聚类，往后就会看到，聚类算法采用的是一种依附的策略，采用递归方法得到自身的类别，现在先做个铺垫。

cluster by fast search and find of density peaks

对于每一个点来说，只要有密度比他大的点，那么这个phi一定存在；那么对于局部密度最大的点如何计算呢：

cluster by fast search and find of density peaks

想想作者还是蛮聪明的，这样计算的话直接把密度最大的点给识别出来了。

上matlab代码

 1 function y=get_phi(dis,p)
 2 [m,n]=size(dis);
 3 y=zeros(m,2);
 4 [p1,index]=sort(p);
 5 for i=1:m-1
 6     tmp_dis=dis(index(i),:);
 7     tmp_dis(index(1:i))=max(max(dis));
 8     [a,b]=min(tmp_dis);   
 9     y(index(i),:)=[a,b];
10 end
11 y(index(m),1)=max(dis(index(m),:));
12 y(index(m),2)=index(m);

具体编写程序的时候还是有一些技巧的。

ok 有了p_i和 cluster by fast search and find of density peaks ，就可以画出一个decision graph的东西，文章中给出的是A：

cluster by fast search and find of density peaks

越靠近右上角的点越容易被认为是聚类中心点（广义上的），或者我们也可以计算

cluster by fast search and find of density peaks

画出来的图是B

实际上，在做实验的过程中，我发现通过B这个图更容易看出聚类中心有几个。

上面找到了聚类中心，剩下的问题就是如何聚类了。

3. 聚类过程

上面我们得到了聚类中心，每个点的比其密度大的最近的点，我们利用这两个信息进行聚类：

举个例子：

cluster by fast search and find of density peaks

这个例子中的聚类中心是1和10，对18这个点进行聚类。

a)首先初始化1和10的label，分别为label_1和label_2,其他的不是聚类中心的点都没有聚类的label

b) 比18密度大的最近的点是4，这时如果4是有label的，那么将4的label赋给18，聚类完成；显然此时18没有聚类，那么进入 c)

c）比4密度大的最近的点是1，而1是有label的，那么将18的label标识为label_1，完成。

实际操作中可能b)和c)要进行很多次循环才行，但是这个肯定是个线性的，并且递归的深度也不会太深，这样就完成了聚类过程。

代码：

 1 function new_label=cluster_density(phi_index,r,cluster_num)
 2 
 3 m=length(phi_index);
 4 new_label=zeros(m,1);
 5 [r1,index_r1]=sort(r,'descend');
 6 
 7 for i=1:cluster_num
 8     new_label(index_r1(i))=i;
 9     tt=index_r1(i);
10 end
11 
12 for i=1:m
13     if (new_label(i)==0)
14         j=i;
15         while(new_label(j)==0)
16             j=phi_index(j);
17         end
18         new_label(i)=new_label(j);
19     end
20     
21 end

算法到这里大体完成了，下面给出几个聚类结果的例子：

原始和决策图

cluster by fast search and find of density peaks

结果

cluster by fast search and find of density peaks

文章链接：

Alex Rodriguez and Alessandro Laio, Clustering by fast search and find of density peaks, Science 344, 1492(2014)

http://www.sciencemag.org/content/344/6191/1492.full

原创文章，转载请注明出处，谢谢！