在 sklearn 中使用自定义距离度量进行聚类答案

【问题标题】：Clustering with custom distance metric in sklearn在 sklearn 中使用自定义距离度量进行聚类
【发布时间】：2020-01-21 03:38:44
【问题描述】：

我正在尝试为聚类实现自定义距离度量。代码 sn-p 如下所示：

import numpy as np
from sklearn.cluster import KMeans, DBSCAN, MeanShift

def distance(x, y):
    # print(x, y) -> This x and y aren't one-hot vectors and is the source of this question
    match_count = 0.
    for xi, yi in zip(x, y):
        if float(xi) == 1. and xi == yi:
            match_count += 1
    return match_count

def custom_metric(x, y):
    # x, y are two vectors
    # distance(.,.) calculates count of elements when both xi and yi are True
    return distance(x, y)


vectorized_text = np.stack([[1, 0, 0, 1] * 100,
                            [1, 1, 1, 0] * 100,
                            [0, 1, 1, 0] * 100,
                            [0, 0, 0, 1] * 100] * 100)

dbscan = DBSCAN(min_samples=2, metric=custom_metric, eps=3, p=1).fit(vectorized_text)

vectorized_text 是大小为n_sample x n_features 的单热编码特征矩阵。但是当custom_metric 被调用时，x 或y 中的一个变成了实值向量，而另一个仍然是 one-hot 向量。可以预料，x 和 y 都应该是 one-hot 向量。这导致 custom_metric 在运行时返回错误的结果，因此聚类不正确。

distance(x, y) 方法中的x 和y 示例：

x = [0.5 0.5 0.5 ... 0.5 0.5]
y = [0. 0. 0. 1. 0. 0. ... 1. 0.]

两者都应该是 one-hot 向量。

有没有人想办法解决这种情况？

【问题讨论】：

我认为您需要包含 custom_metric 的代码..
@PV8：已添加。请检查
正如您在我的回答中看到的那样，它正在工作，您能否在运行该函数之前打印 x, y...
x 和 y 是函数的输入；他们怎么能“变成”任何东西？
@desertnaut 请尝试运行代码，您应该能够重新生成错误。

标签： python python-3.x scikit-learn cluster-analysis

【解决方案1】：

首先，你的距离是错误的。

距离必须返回相似向量的 small 值。您定义了相似度，而不是距离。

其次，使用诸如zip 之类的幼稚python 代码将执行极差。 Python 只是没有很好地优化这样的代码，它会在 slow 解释器中完成所有工作。只有对所有内容进行矢量化，Python 速度才可以。事实上，这段代码可以简单地被矢量化，然后你的输入是二进制数据还是浮点数据都可能无关紧要。您以非常复杂的方式计算的只是两个向量的点积，不是吗？

这个，你的距离大概应该是这样的：

def distance(x, y):
  return x.shape[0] - np.dot(x,y)

或您打算使用的任何距离转换。

现在解决您的实际问题：我的猜测是 sklearn 会尝试使用球树来加速您的距离。由于 Python 解释器回调的性能不佳，这无济于事（实际上，您可能应该在 one 向量化操作中预先计算整个距离矩阵 - 类似于dist = dim - X.transpose().dot(X)？自己做数学找出方程）。其他语言，如 Java（例如，ELKI 工具）以这种方式进行扩展要好得多，因为热点 JIT 编译器可以在任何地方优化和内联此类调用。

要检验 sklearn 球树是您观察到的奇数值的原因的假设，请尝试设置 method="brute" 左右（请参阅文档）以禁用球树。但最后，您需要预先计算整个距离矩阵（如果您能负担得起 O(n²) 成本），或者切换到不同的编程语言（例如在 Cython 中实现距离会有所帮助，但您仍然会可能会突然看到数据是 numpy 浮点数组）。

【讨论】：

感谢@Anony-Mousse 的评论。一个问题： 1. 我怎么知道什么是small 值，什么不是？标准有什么特别的定义吗？
相同的向量必须返回一个小的值 0，根据距离的定义......并且最大不同的向量应该有最大的距离，远大于 0。例如 1 或 1000。关键是您的功能完全关闭。对于没有共同点的向量，您返回距离 0。因此，它不是距离。
您可以使用具有这种相似性功能的GeneralizedDBSCAN，但这不在sklearn中。我所知道的唯一实现相似之处的 DBSCAN 是在 ELKI 中：SimilarityNeighborPredicate

【解决方案2】：

我复制了您的代码，但确实收到了您的错误。我在这里解释得更好：

他有一个 vectorized_text 变量 (np.stack)，它模拟 One Hot Encoded 特征集（仅包含 0 和 1）。而在 DBSCAN 模型中，他使用 custom_metric 函数来计算距离。预计当模型运行时，自定义度量函数将观察值对作为参数，因为它们是：一个热编码值，但在 distance 函数中打印这些值时，只有一个是照原样考虑，另一个似乎是他在问题中描述的真实值列表：

x = [0.5 0.5 0.5 ... 0.5 0.5] y = [0. 0. 0. 1. 0. 0. ... 1. 0.]

无论如何，当我将列表传递给 fit 参数时，函数会按原样获取值：

from sklearn.cluster import KMeans, DBSCAN, MeanShift

x = [1, 0, 1]
y = [0, 0, 1]
feature_set = [x*5]*5
def distance(x, y):
    # Printing here the values. Should be 0s and 1s
    print(x, y)
    match_count = 0.
    for xi, yi in zip(x, y):
        if float(xi) == 1. and xi == yi:
            match_count += 1
    return match_count

def custom_metric(x, y):
    # x, y are two vectors
    # distance(.,.) calculates count of elements when both xi and yi are True
    return distance(x, y)

dbscan = DBSCAN(min_samples=2, metric=custom_metric, eps=3, p=1).fit(feature_set)`

结果：

[1. 0. 1. 1. 0. 1. 1. 0. 1. 1. 0. 1. 1. 0. 1.] ... [1. 0. 1. 1. 0.1. 1. 0. 1. 1. 0. 1. 1. 0. 1.]
[1. 0. 1. 1. 0. 1. 1. 0. 1. 1. 0. 1. 1. 0. 1.] ... [1. 0. 1. 1. 0.1. 1. 0. 1. 1. 0. 1. 1. 0. 1.]

我建议你使用 pandas DataFrame 或其他类型的值，看看它是否有效。

【讨论】：

【解决方案3】：

我不明白你的问题，如果我有：

x = [1, 0, 1]
y = [0, 0, 1]

我使用：

def distance(x, y):
    # print(x, y) -> This x and y aren't one-hot vectors and is the source of this question
    match_count = 0.
    for xi, yi in zip(x, y):
        if float(xi) == 1. and xi == yi:
            match_count += 1
    return match_count

print(distance(x, y))
 1.0

如果你现在打印 x, y：

x
[1, 0, 1]
y
[0, 0, 1]

所以它工作了吗？

【讨论】：

这正是问题所在，当使用此自定义度量时，在聚类管道中调用距离时，向量 x 和 y 与预期不符。一个类似的问题：stackoverflow.com/questions/41863635/…
@user3480922 你是什么意思“向量 x 和 y 不符合预期？x 和 y 是函数的输入，并且它们不是由它定义；您的问题完全不清楚
@desertnaut 我理解您的困惑，但如果您尝试重现错误，您可能会看到差异。