聚类数据 - 结果不佳，特征提取答案

【问题标题】：Clustering data- Poor results, feature extraction聚类数据 - 结果不佳，特征提取
【发布时间】：2020-07-18 13:45:35
【问题描述】：

我测量了在不同运行条件下运行的风力涡轮机的数据（振动）。我的数据集包含操作条件以及我从测量数据中提取的测量特征。

数据集形状：(423, 15)。 423 个数据点中的每一个都代表一天的测量值，按时间顺序超过 423 天。

我现在想对数据进行聚类以查看测量值是否有任何变化。具体来说，我想检查振动是否随时间变化（这可能表明涡轮齿轮箱出现故障）。

我目前做了什么：

在 0,1 之间缩放数据 ->
执行 PCA（从 15 减少到 5）
集群使用db scan，因为我不知道集群的数量。我正在使用此代码在 dbscan 中找到最佳 epsilon (eps)：

# optimal Epsilon (distance):
X_pca = principalDf.values
neigh = NearestNeighbors(n_neighbors=2)
nbrs = neigh.fit(X_pca)
distances, indices = nbrs.kneighbors(X_pca)
distances = np.sort(distances, axis=0)
distances = distances[:,1]
plt.plot(distances,color="#0F215A")
plt.grid(True)

目前的结果并未明确表明数据会随时间发生变化：

当然，这种情况可能是数据没有在这些数据点上发生变化。但是，我还可以尝试哪些其他方法？有点悬而未决的问题，但我的想法已经不多了。

【问题讨论】：

标签： python cluster-analysis signal-processing unsupervised-learning

【解决方案1】：

首先，使用 KMeans，如果数据集没有自然分区，您最终可能会得到一些非常奇怪的结果！由于 KMeans 是无监督的，因此您基本上可以转储各种数值变量，设置目标变量，然后让机器为您完成任务。这是一个使用规范 Iris 数据集的简单示例。您可以轻松修改它以适合您的特定数据集。只需更改“X”变量（除目标变量外的所有变量）和“y”变量（仅一个目标变量）。尝试并反馈。

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
import urllib.request
import random
# seaborn is a layer on top of matplotlib which has additional visualizations -
# just importing it changes the look of the standard matplotlib plots.
# the current version also shows some warnings which we'll disable.
import seaborn as sns
sns.set(style="white", color_codes=True)
import warnings
warnings.filterwarnings("ignore")


from sklearn import datasets
iris = datasets.load_iris()
X = iris.data[:, 0:4]  # we only take the first two features.
y = iris.target



from sklearn import preprocessing

scaler = preprocessing.StandardScaler()

scaler.fit(X)
X_scaled_array = scaler.transform(X)
X_scaled = pd.DataFrame(X_scaled_array)

X_scaled.sample(5)


# try clustering on the 4d data and see if can reproduce the actual clusters.

# ie imagine we don't have the species labels on this data and wanted to
# divide the flowers into species. could set an arbitrary number of clusters
# and try dividing them up into similar clusters.

# we happen to know there are 3 species, so let's find 3 species and see
# if the predictions for each point matches the label in y.

from sklearn.cluster import KMeans

nclusters = 3 # this is the k in kmeans
seed = 0

km = KMeans(n_clusters=nclusters, random_state=seed)
km.fit(X_scaled)

# predict the cluster for each data point
y_cluster_kmeans = km.predict(X_scaled)
y_cluster_kmeans


# use seaborn to make scatter plot showing species for each sample
sns.FacetGrid(data, hue="species", size=4) \
   .map(plt.scatter, "sepal_length", "sepal_width") \
   .add_legend();

【讨论】：