Python：k-means 对来自预定 csv 的多个变量进行聚类答案

【问题标题】：Python: k-means clustering on multiple variables from a predetermined csvPython：k-means 对来自预定 csv 的多个变量进行聚类
【发布时间】：2018-11-28 23:54:13
【问题描述】：

我正在为我的论文做一个项目，但我很伤心，因为我无法通过 Spotify API 对我的数据集进行 k-means 聚类。

artist_name track_popularity explicit artist_genres album_genres soundness danceability energy instrumentalness key liveness Loudness mode Speechness tempo time_signature valence mapped_at

我的数据集有这些变量，我必须对从声学到化合价的变量进行聚类（所以 12 个变量）。我怎样才能做到这一点？我可以用 2 或 3 个变量来做这件事，但我不能用四个或四个以上的变量来做。

> from copy import deepcopy
import numpy as np
import matplotlib.pyplot as plot
import pandas as pd
from sklearn.cluster import KMeans
#importing Dataset
dataset = pd.read_csv('csvProva2.csv')
X = dataset.iloc[:, [10,11]].values #colonne che mi interessano

#Find the number of clusters
wcss = []

for i in range (1,16): #15 cluster
    kmeans = KMeans(n_clusters = i, init='k-means++', random_state=0) 
    kmeans.fit(X)
    wcss.append(kmeans.inertia_)

plot.plot(range(1,16),wcss)
plot.title('Elbow Method')
plot.xlabel('Number of clusters')
plot.ylabel('wcss')
plot.show()

#KMeans clustering
kmeans= KMeans(n_clusters=4,init='k-means++', random_state=0)
y=kmeans.fit_predict(X)

plot.scatter(X[y == 0,0], X[y==0,1], s=25, c='red', label='Cluster 1')
plot.scatter(X[y == 1,0], X[y==1,1], s=25, c='blue', label='Cluster 2')
plot.scatter(X[y == 2,0], X[y==2,1], s=25, c='magenta', label='Cluster 3')
plot.scatter(X[y == 3,0], X[y==3,1], s=25, c='cyan', label='Cluster 4')

plot.scatter(kmeans.cluster_centers_[:,0], kmeans.cluster_centers_[:,1], s=25, c='yellow', label='Centroid')
plot.title('KMeans Clustering')
plot.xlabel('Acousticness')
plot.ylabel('Danceability')
plot.legend()
plot.show()

这是我使用 2 个变量进行聚类的代码。

【问题讨论】：

我在这里解决了：github.com/joaocarvalhoopen/…

标签： python scikit-learn cluster-analysis k-means

【解决方案1】：

K-means 可以在超过 3 个变量上正常运行。

但是它们需要是连续变量。您无法计算分类变量的平均值。此外，将变量与不同的饼（单位）混合是有问题的。然后，小规模特征将大部分被忽略。从统计上看，结果变得毫无意义：如果你以不同的方式缩放数据，你会得到不同的结果。

【讨论】：

数据集某些列有NaN值怎么办？