k-means算法python实现

前言

最近在上《Big data Technology and Algorithm》，老师介绍了几种常见的数据挖掘算法，打算都试着实现一遍。初学代码，写的很简陋，各位大牛能指出不足的地方。

代码

数据集：随机生成的100个坐标点

随即生成k（3）个初始中心，计算各个点到各中心的距离（欧式距离），将各个点分配到距离最近的中心点，然后每个中心点再根据属于它的数据点，重新调整位置。当各个中心的位置不再变化，或者迭代次数达到20时，停止迭代，返回各个中心的位置，并保存每次迭代的散点图。

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

def setData():
	#随机生成100个二位数据
	x = np.random.rand(100)
	y = np.random.rand(100)
	loc = list(zip(x,y))
	dots = pd.DataFrame(loc,columns=['x','y'])
	dots['tag'] = -1
	#每个点所属的中心，初值为-1
	return dots

def show(dataset,centroid,k):
	#生成结果图像
	fig = plt.figure(k)
	cmark = ['b','g','r']
	for i in range(len(dataset)):
		plt.scatter(dataset['x'][i],dataset['y'][i],color=cmark[dataset['tag'][i]])
	plt.scatter(centroid['x'],centroid['y'],color='k')
	fig.savefig('result'+str(k)+'.png')
	#保存结果

def length(x,y,cx,cy):
	#计算距离，为了计算方便没有开方
	return pow(x-cx,2) + pow(y-cy,2)

def calcenter(dataset):
	#计算中心
	x = [0,0,0]
	y = [0,0,0]
	num = [0,0,0]
	for index, row in dataset.iterrows():
		i = int(row['tag'])
		x[i] += row['x']
		y[i] += row['y']
		num[i] += 1
	for i in range(3):
		x[i] = x[i] / num[i]
		y[i] = y[i] / num[i]
	centroid = pd.DataFrame({'x':x,'y':y})
	return centroid

def judge(old,new):
	#判断新、旧的中心是否有变动
	for index, row in old.iterrows():
		for i in ['x','y']:
			if row[i] != new[i][index]:
				return 0
	return 1


def kmeans(dataset,k):
	#k-means算法，k为中心个数，返回中心点坐标，并保存每次求得图像
	centroid = pd.DataFrame(list(zip(np.random.rand(k),np.random.rand(k))),columns=['x','y'])
	print(centroid)
	#随机初始化中心
	old_centroid = pd.DataFrame({'x':[0,0,0],'y':[0,0,0]})
	#初始化旧的中心
	len = [0,0,0]
	#点到各中心的距离

	flag = 1
	time = 0
	show(dataset,centroid,time)
	#保存初始值
	while(time<20 and judge(old_centroid,centroid)==0):
		time += 1
		old_centroid = centroid
		for index, row in dataset.iterrows():
			len[0] = length(row['x'],row['y'],centroid['x'][0],centroid['y'][0])
			len[1] = length(row['x'],row['y'],centroid['x'][1],centroid['y'][1])
			len[2] = length(row['x'],row['y'],centroid['x'][2],centroid['y'][2])
			tmp = len.index(min(len))
			if(dataset['tag'][index] != tmp):
				dataset['tag'][index] = tmp
				#报警告：A value is trying to be set on a copy of a slice from a DataFrame
		centroid = calcenter(dataset)
		show(dataset,centroid,time)
		#保存每次计算的结果
	print(time)
	return centroid

def main():
	#主程序，显示中心
	dataset = setData();
	print(kmeans(dataset,3))

if __name__=='__main__':
	main()

结果

数据集

           x         y  tag
0   0.931254  0.050561   -1
1   0.273341  0.999935   -1
2   0.771235  0.542644   -1
3   0.418120  0.365542   -1
4   0.019799  0.273949   -1
5   0.339480  0.447113   -1
6   0.820338  0.144518   -1
7   0.700898  0.116995   -1
8   0.391173  0.742299   -1
9   0.072874  0.368073   -1
10  0.483073  0.769322   -1
11  0.644735  0.832955   -1
12  0.267050  0.853286   -1
13  0.204377  0.346626   -1
14  0.928878  0.589187   -1
15  0.886251  0.630300   -1
16  0.371003  0.183797   -1
17  0.482584  0.006381   -1
18  0.090043  0.377324   -1
19  0.106654  0.714467   -1
20  0.345479  0.806465   -1
21  0.712029  0.853716   -1
22  0.738493  0.149477   -1
23  0.528958  0.513719   -1
24  0.614079  0.632119   -1
25  0.248881  0.546304   -1
26  0.209416  0.909776   -1
27  0.692365  0.177673   -1
28  0.163461  0.536138   -1
29  0.839105  0.316571   -1
..       ...       ...  ...
70  0.608191  0.394031   -1
71  0.241051  0.613188   -1
72  0.358178  0.919382   -1
73  0.861288  0.131707   -1
74  0.112741  0.970181   -1
75  0.171044  0.891548   -1
76  0.034619  0.777317   -1
77  0.398640  0.928208   -1
78  0.004283  0.942054   -1
79  0.520483  0.973023   -1
80  0.981602  0.077871   -1
81  0.420147  0.204210   -1
82  0.912305  0.475094   -1
83  0.325896  0.467796   -1
84  0.445504  0.530625   -1
85  0.966360  0.595105   -1
86  0.609595  0.277007   -1
87  0.218131  0.597103   -1
88  0.428447  0.411958   -1
89  0.753665  0.663576   -1
90  0.409305  0.579387   -1
91  0.623487  0.624547   -1
92  0.898097  0.716681   -1
93  0.719279  0.419645   -1
94  0.399278  0.944386   -1
95  0.807850  0.507538   -1
96  0.454854  0.627525   -1
97  0.844272  0.803904   -1
98  0.213164  0.555063   -1
99  0.396978  0.807812   -1

初始中心

          x         y
0  0.148365  0.726790
1  0.308363  0.211619
2  0.278807  0.835321

迭代次数

中心

          x         y
0  0.226376  0.549994
1  0.671234  0.167405
2  0.689283  0.743575

问题

运行代码时发生警告，暂时弄不清楚。求助各位大牛

E:\CityU\Courses\CS5488 Big Data Technology and Algorithm\python\kmeans.py:77: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame