前言
最近在上《Big data Technology and Algorithm》,老师介绍了几种常见的数据挖掘算法,打算都试着实现一遍。初学代码,写的很简陋,各位大牛能指出不足的地方。
代码
数据集:随机生成的100个坐标点
随即生成k(3)个初始中心,计算各个点到各中心的距离(欧式距离),将各个点分配到距离最近的中心点,然后每个中心点再根据属于它的数据点,重新调整位置。当各个中心的位置不再变化,或者迭代次数达到20时,停止迭代,返回各个中心的位置,并保存每次迭代的散点图。
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
def setData():
#随机生成100个二位数据
x = np.random.rand(100)
y = np.random.rand(100)
loc = list(zip(x,y))
dots = pd.DataFrame(loc,columns=['x','y'])
dots['tag'] = -1
#每个点所属的中心,初值为-1
return dots
def show(dataset,centroid,k):
#生成结果图像
fig = plt.figure(k)
cmark = ['b','g','r']
for i in range(len(dataset)):
plt.scatter(dataset['x'][i],dataset['y'][i],color=cmark[dataset['tag'][i]])
plt.scatter(centroid['x'],centroid['y'],color='k')
fig.savefig('result'+str(k)+'.png')
#保存结果
def length(x,y,cx,cy):
#计算距离,为了计算方便没有开方
return pow(x-cx,2) + pow(y-cy,2)
def calcenter(dataset):
#计算中心
x = [0,0,0]
y = [0,0,0]
num = [0,0,0]
for index, row in dataset.iterrows():
i = int(row['tag'])
x[i] += row['x']
y[i] += row['y']
num[i] += 1
for i in range(3):
x[i] = x[i] / num[i]
y[i] = y[i] / num[i]
centroid = pd.DataFrame({'x':x,'y':y})
return centroid
def judge(old,new):
#判断新、旧的中心是否有变动
for index, row in old.iterrows():
for i in ['x','y']:
if row[i] != new[i][index]:
return 0
return 1
def kmeans(dataset,k):
#k-means算法,k为中心个数,返回中心点坐标,并保存每次求得图像
centroid = pd.DataFrame(list(zip(np.random.rand(k),np.random.rand(k))),columns=['x','y'])
print(centroid)
#随机初始化中心
old_centroid = pd.DataFrame({'x':[0,0,0],'y':[0,0,0]})
#初始化旧的中心
len = [0,0,0]
#点到各中心的距离
flag = 1
time = 0
show(dataset,centroid,time)
#保存初始值
while(time<20 and judge(old_centroid,centroid)==0):
time += 1
old_centroid = centroid
for index, row in dataset.iterrows():
len[0] = length(row['x'],row['y'],centroid['x'][0],centroid['y'][0])
len[1] = length(row['x'],row['y'],centroid['x'][1],centroid['y'][1])
len[2] = length(row['x'],row['y'],centroid['x'][2],centroid['y'][2])
tmp = len.index(min(len))
if(dataset['tag'][index] != tmp):
dataset['tag'][index] = tmp
#报警告:A value is trying to be set on a copy of a slice from a DataFrame
centroid = calcenter(dataset)
show(dataset,centroid,time)
#保存每次计算的结果
print(time)
return centroid
def main():
#主程序,显示中心
dataset = setData();
print(kmeans(dataset,3))
if __name__=='__main__':
main()
结果
数据集
x y tag
0 0.931254 0.050561 -1
1 0.273341 0.999935 -1
2 0.771235 0.542644 -1
3 0.418120 0.365542 -1
4 0.019799 0.273949 -1
5 0.339480 0.447113 -1
6 0.820338 0.144518 -1
7 0.700898 0.116995 -1
8 0.391173 0.742299 -1
9 0.072874 0.368073 -1
10 0.483073 0.769322 -1
11 0.644735 0.832955 -1
12 0.267050 0.853286 -1
13 0.204377 0.346626 -1
14 0.928878 0.589187 -1
15 0.886251 0.630300 -1
16 0.371003 0.183797 -1
17 0.482584 0.006381 -1
18 0.090043 0.377324 -1
19 0.106654 0.714467 -1
20 0.345479 0.806465 -1
21 0.712029 0.853716 -1
22 0.738493 0.149477 -1
23 0.528958 0.513719 -1
24 0.614079 0.632119 -1
25 0.248881 0.546304 -1
26 0.209416 0.909776 -1
27 0.692365 0.177673 -1
28 0.163461 0.536138 -1
29 0.839105 0.316571 -1
.. ... ... ...
70 0.608191 0.394031 -1
71 0.241051 0.613188 -1
72 0.358178 0.919382 -1
73 0.861288 0.131707 -1
74 0.112741 0.970181 -1
75 0.171044 0.891548 -1
76 0.034619 0.777317 -1
77 0.398640 0.928208 -1
78 0.004283 0.942054 -1
79 0.520483 0.973023 -1
80 0.981602 0.077871 -1
81 0.420147 0.204210 -1
82 0.912305 0.475094 -1
83 0.325896 0.467796 -1
84 0.445504 0.530625 -1
85 0.966360 0.595105 -1
86 0.609595 0.277007 -1
87 0.218131 0.597103 -1
88 0.428447 0.411958 -1
89 0.753665 0.663576 -1
90 0.409305 0.579387 -1
91 0.623487 0.624547 -1
92 0.898097 0.716681 -1
93 0.719279 0.419645 -1
94 0.399278 0.944386 -1
95 0.807850 0.507538 -1
96 0.454854 0.627525 -1
97 0.844272 0.803904 -1
98 0.213164 0.555063 -1
99 0.396978 0.807812 -1
初始中心
x y
0 0.148365 0.726790
1 0.308363 0.211619
2 0.278807 0.835321
迭代次数
7
中心
x y
0 0.226376 0.549994
1 0.671234 0.167405
2 0.689283 0.743575
问题
运行代码时发生警告,暂时弄不清楚。求助各位大牛
E:\CityU\Courses\CS5488 Big Data Technology and Algorithm\python\kmeans.py:77: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame