实验描述:
算法在2个数据集上进行测试,分别是20-newsgroups dataset和Yahoo news dataset,其中20-newsgroups dataset包含20000条文本记录,每条记录通过26099个维度进行存储;Yahoo news dataset包含2340条记录,每条记录通过21839个维度进行存储;
![论文:Banerjee A, Ghosh J. On Scaling Up Balanced Clustering Algorithms.[C]笔记 论文:Banerjee A, Ghosh J. On Scaling Up Balanced Clustering Algorithms.[C]笔记](/default/index/img?u=L2RlZmF1bHQvaW5kZXgvaW1nP3U9YUhSMGNITTZMeTl3YVdGdWMyaGxiaTVqYjIwdmFXMWhaMlZ6THpRM05pODNZVEkzWldZMU16Sm1Nakl3TldWak1tTTFObVl3WldVNE9XTTRZVFJrTkM1d2JtYz0=)
上图表示,(a)与(b)分别表示在两个数据集上的目标函数值与error bar值随着K变化的改变趋势,当k小于15时,新方法fsk-means与传统K-Means的目标函数值相同,当K大于15时,fsk-means在目标函数上的表现要优于传统K-Means方法;
![论文:Banerjee A, Ghosh J. On Scaling Up Balanced Clustering Algorithms.[C]笔记 论文:Banerjee A, Ghosh J. On Scaling Up Balanced Clustering Algorithms.[C]笔记](/default/index/img?u=L2RlZmF1bHQvaW5kZXgvaW1nP3U9YUhSMGNITTZMeTl3YVdGdWMyaGxiaTVqYjIwdmFXMWhaMlZ6THpNeE9DOWpORGd4TkdFNU5XTTNObVJrTnpkbE1UQmxOR05tT0RneE9ESTNZemxsTmk1d2JtYz0=)
上图表示,(c)与(d)分别表示在两个数据集上的聚类结果的簇大小方差随着K变化的改变趋势,新方法fsk-means在K的值大于15时,簇大小方差较小;
![论文:Banerjee A, Ghosh J. On Scaling Up Balanced Clustering Algorithms.[C]笔记 论文:Banerjee A, Ghosh J. On Scaling Up Balanced Clustering Algorithms.[C]笔记](/default/index/img?u=L2RlZmF1bHQvaW5kZXgvaW1nP3U9YUhSMGNITTZMeTl3YVdGdWMyaGxiaTVqYjIwdmFXMWhaMlZ6THpNek9DODVOVFF5WXpkbU16QXlOall6TTJaaE1UYzJORFU0WkdJek1UaG1aR0ptWVM1d2JtYz0=)
上图表示,(e)与(f)分别表示在两个数据集上聚类结果的最小簇大小与期望的簇大小比值,传统K-Means方法在K大于15时出现了规模很小的簇甚至出现了空簇,而fsk-means方法簇的大小较平衡且没有出现空簇的情况;
另外,fsk-means具有三个版本,分别是greedy fsk-means, normal fsk-means, rippling fsk-means, 其中greedy fsk-means方法的目标函数值在相关系数变化范围内依然表现较好;
![论文:Banerjee A, Ghosh J. On Scaling Up Balanced Clustering Algorithms.[C]笔记 论文:Banerjee A, Ghosh J. On Scaling Up Balanced Clustering Algorithms.[C]笔记](/default/index/img?u=L2RlZmF1bHQvaW5kZXgvaW1nP3U9YUhSMGNITTZMeTl3YVdGdWMyaGxiaTVqYjIwdmFXMWhaMlZ6THpjME5pODFaR00xTVdSa01UTXlNR000T0RNeU1UazJNVFJqWVRsbU5UZzNPV0prWVM1d2JtYz0=)
![论文:Banerjee A, Ghosh J. On Scaling Up Balanced Clustering Algorithms.[C]笔记 论文:Banerjee A, Ghosh J. On Scaling Up Balanced Clustering Algorithms.[C]笔记](/default/index/img?u=L2RlZmF1bHQvaW5kZXgvaW1nP3U9YUhSMGNITTZMeTl3YVdGdWMyaGxiaTVqYjIwdmFXMWhaMlZ6THpJM0x6QmhaVE0xWkdZME5XUmxaakZpTlRZNFltRmtNREUyWkRNNU1EVmhOVFl6TG5CdVp3PT0=)
(a)(b)(c)(d)分别表示使用20-newsgroups dataset数据集中1000,2000,5000,10000条数据的三种聚类的目标函数结果;
![论文:Banerjee A, Ghosh J. On Scaling Up Balanced Clustering Algorithms.[C]笔记 论文:Banerjee A, Ghosh J. On Scaling Up Balanced Clustering Algorithms.[C]笔记](/default/index/img?u=L2RlZmF1bHQvaW5kZXgvaW1nP3U9YUhSMGNITTZMeTl3YVdGdWMyaGxiaTVqYjIwdmFXMWhaMlZ6THpFNE5pOWxOekExTm1aaU5XUTROMll3TURNellqY3pOV1U0WldGak5UUXhaRGc1WVM1d2JtYz0=)
(e),(f)表示k为20,n分别为2000与5000三种情况下聚类簇大小的方差;
算法优缺点:
优点:算法很大参数范围内都能进行平衡聚类,第一步抽取样本的规模对聚类效果没有影响,对于Yahoo!数据集,数据本身原始簇很不平衡的基础上,fsk-means方法能够取得较好的平衡聚类效果;
缺点:在聚类过程中计算量较大,算法针对的数据种类有限;
相关文章: