【发布时间】:2020-08-22 00:07:16
【问题描述】:
我用Scikit-learn 构建了一个最近邻模型。在拟合模型后使用 kneighbors 方法获取集群时,集群似乎很好。
model = NearestNeighbors(n_jobs=-1, n_neighbors=5).fit(np.array(df))
distance, indices = model.kneighbors(np.array(df)) ## one of the distances is always 0, as expected. And clusters are acceptable.
但是当我保存模型然后读取训练数据时,输出是不可接受的。
model = pickle.load(f)
distance, indices = model.kneighbors(np.array(df)) ## same dataset, average/bad results. None of distances are 0.
而且,最大的问题是,索引和距离会根据 df 大小而变化。
model = pickle.load(f)
df_1 = df[df["id"] == "1"] # Trying for just one user
distance, indices = model.kneighbors(np.array(df_1)) ## one row, same output for every user.
df_2 = df[df["id"] == "2"]
distance, indices = model.kneighbors(np.array(df_2)) ## same output
df = df[df["id"] == "2" | df["id"] == "1"]
distance, indices = model.kneighbors(np.array(df)) ## different output for both
训练/测试数据集如下所示
feature1 | feature2 | feature3
0 1 1
1 1 1
0 0 0
如果模型在使用不同的数据集后无法使用,为什么还要训练和保存模型?这是模型的预期行为还是我错过了什么?
【问题讨论】:
-
NearestNeighbors只是评估 df 中样本之间的距离,因此您应该得到相同的结果。您能否添加一个示例 df 和您的保存/加载代码以获得 Minimal reproductible example ?
标签: python machine-learning scikit-learn data-science nearest-neighbor