查找数组中的最近点 - KDTree 的倒数答案

【问题标题】：Finding closest point in array - inverse of KDTree查找数组中的最近点 - KDTree 的倒数
【发布时间】：2017-10-11 16:44:31
【问题描述】：

我有一个非常大的 ndarray A，和一个点 k 的排序列表（一个小列表，大约 30 个点）。

对于 A 的每个元素，我想确定点 k 列表中最接近的元素以及索引。所以像：

>>> A = np.asarray([3, 4, 5, 6])
>>> k = np.asarray([4.1, 3])
>>> values, indices
[3, 4.1, 4.1, 4.1], [1, 0, 0, 0]

现在，问题是 A 非常非常大。所以我不能做一些低效的事情，比如向 A 添加一个维度，将 abs 差取为 k，然后取每一列的最小值。

目前我一直在使用 np.searchsorted，如这里的第二个答案所示：Find nearest value in numpy array 但即使这样也太慢了。这是我使用的代码（修改为使用多个值）：

def find_nearest(A,k):

    indicesClosest = np.searchsorted(k, A)
    flagToReduce = indicesClosest==k.shape[0]
    modifiedIndicesToAvoidOutOfBoundsException = indicesClosest.copy()
    modifiedIndicesToAvoidOutOfBoundsException[flagToReduce] -= 1
    flagToReduce = np.logical_or(flagToReduce,
                     np.abs(A-k[indicesClosest-1]) <
                     np.abs(A - k[modifiedIndicesToAvoidOutOfBoundsException]))
    flagToReduce = np.logical_and(indicesClosest > 0, flagToReduce)
    indicesClosest[flagToReduce] -= 1
    valuesClosest = k[indicesClosest]
    return valuesClosest, indicesClosest

然后我想到了使用 scipy.spatial.KDTree：

>>> d = scipy.spatial.KDTree(k)
>>> d.query(A)

结果证明这比 searchsorted 解决方案慢得多。

另一方面，数组 A 总是相同的，只有 k 变化。所以在 A 上使用一些辅助结构（如“逆 KDTree”），然后在小数组 k 上查询结果是有益的。

有类似的吗？

编辑

目前我正在使用需要对数组 A 进行排序的 np.searchsorted 变体。我们可以提前将其作为预处理步骤，但我们仍然需要在计算索引后恢复原始顺序。此变体的速度大约是上述变体的两倍。

A = np.random.random(3000000)
k = np.random.random(30)

indices_sort = np.argsort(A)
sortedA = A[indices_sort]

inv_indices_sort = np.argsort(indices_sort)
k.sort()


def find_nearest(sortedA, k):
    midpoints = k[:-1] + np.diff(k)/2
    idx_aux = np.searchsorted(sortedA, midpoints)
    idx = []
    count = 0
    final_indices = np.zeros(sortedA.shape, dtype=int)
    old_obj = None
    for obj in idx_aux:
        if obj != old_obj:
            idx.append((obj, count))
            old_obj = obj
        count += 1
    old_idx = 0
    for idx_A, idx_k in idx:
        final_indices[old_idx:idx_A] = idx_k
        old_idx = idx_A
    final_indices[old_idx:] = len(k)-1

    indicesClosest = final_indices[inv_indices_sort] #<- this takes 90% of the time
    return k[indicesClosest], indicesClosest

花费这么多时间的那一行是将索引恢复到原来的顺序。

【问题讨论】：

您有多个value。那么，您在使用searchsorted 时是否在循环？显示您的搜索排序尝试？还是您使用了此代码 - stackoverflow.com/a/26026189？
请比“非常非常大”更具体。给出A 的典型大小。
@Divakar 是的，我使用了那个代码 :) 我会编辑它
@WarrenWeckesser 我有大约 20 个数组 A，每个数组平均有 500 万个元素。有些更大，有些更小。我需要为每个数组 A 执行此操作。
不要认为这是您的尝试，因为它不适用于 k 中的多个值。

标签： python arrays algorithm numpy scipy

【解决方案1】：

更新：

内置函数numpy.digitize 实际上可以完全满足您的需求。只需要一个小技巧：digitize 将值分配给 bins。我们可以将k 转换为 bin，方法是对数组进行排序并将 bin 边界设置在相邻元素之间的中间。

import numpy as np

A = np.asarray([3, 4, 5, 6])
k = np.asarray([4.1, 3, 1])  # added another value to show that sorting/binning works

ki = np.argsort(k)
ks = k[ki]

i = np.digitize(A, (ks[:-1] + ks[1:]) / 2)

indices = ki[i]
values = ks[i]

print(values, indices)
# [ 3.   4.1  4.1  4.1] [1 0 0 0]

旧答案：

我会采用蛮力方法对A 中的每个元素执行一次矢量化传递 k 并更新当前元素改进近似值的那些位置。

import numpy as np

A = np.asarray([3, 4, 5, 6])
k = np.asarray([4.1, 3])

err = np.zeros_like(A) + np.inf  # keep track of error over passes

values = np.empty_like(A, dtype=k.dtype)
indices = np.empty_like(A, dtype=int)

for i, v in enumerate(k):
    d = np.abs(A - v)
    mask = d < err  # only update where v is closer to A
    values[mask] = v
    indices[mask] = i
    err[mask] = d[mask]

print(values, indices)
# [ 3.   4.1  4.1  4.1] [1 0 0 0]

这种方法需要三个与A 大小相同的临时变量，因此如果没有足够的可用内存，它将失败。

【讨论】：

感谢您的回答！不幸的是，蛮力解决方案太慢了。 np.digitize 是个好主意，但我认为它与 np.searchsorted 没有什么不同，对吧？我们的界面略有不同，但运行速度相似。最有可能改善这一点的唯一方法是使用矩阵 A 永远不会改变的事实，只有 k 会改变；因此以某种方式预处理 A 并将其转换为更容易执行必要计算的格式
@Ant 我认为你是对的。我不熟悉searchsorted，所以我没有注意到这种相似性。但是，无论如何，尝试一下digitize 可能是值得的。有时非常相似的 numpy 函数在性能上表现出惊人的差异。

【解决方案2】：

因此，经过一些工作和 scipy 邮件列表中的想法后，我认为在我的情况下（使用恒定的 A 和缓慢变化的 k），最好的方法是使用以下实现。

class SearchSorted:
    def __init__(self, tensor, use_k_optimization=True):

        '''
        use_k_optimization requires storing 4x the size of the tensor.
        If use_k_optimization is True, the class will assume that successive calls will be made with similar k.
        When this happens, we can cut the running time significantly by storing additional variables. If it won't be
        called with successive k, set the flag to False, as otherwise would just consume more memory for no
        good reason
        '''

        self.indices_sort = np.argsort(tensor)
        self.sorted_tensor = tensor[self.indices_sort]
        self.inv_indices_sort = np.argsort(self.indices_sort)
        self.use_k_optimization = use_k_optimization

        self.previous_indices_results = None
        self.prev_idx_A_k_pair = None

    def query(self, k):
        midpoints = k[:-1] + np.diff(k) / 2
        idx_count = np.searchsorted(self.sorted_tensor, midpoints)
        idx_A_k_pair = []
        count = 0

        old_obj = 0
        for obj in idx_count:
            if obj != old_obj:
                idx_A_k_pair.append((obj, count))
                old_obj = obj
            count += 1

        if not self.use_k_optimization or self.previous_indices_results is None:
            #creates the index matrix in the sorted case
            final_indices = self._create_indices_matrix(idx_A_k_pair, self.sorted_tensor.shape, len(k))
            #and now unsort it to match the original tensor position
            indicesClosest = final_indices[self.inv_indices_sort]
            if self.use_k_optimization:
                self.prev_idx_A_k_pair = idx_A_k_pair
                self.previous_indices_results = indicesClosest
            return indicesClosest

        old_indices_unsorted = self._create_indices_matrix(self.prev_idx_A_k_pair, self.sorted_tensor.shape, len(k))
        new_indices_unsorted = self._create_indices_matrix(idx_A_k_pair, self.sorted_tensor.shape, len(k))
        mask = new_indices_unsorted != old_indices_unsorted

        self.prev_idx_A_k_pair = idx_A_k_pair
        self.previous_indices_results[self.indices_sort[mask]] = new_indices_unsorted[mask]
        indicesClosest = self.previous_indices_results

        return indicesClosest

    @staticmethod
    def _create_indices_matrix(idx_A_k_pair, matrix_shape, len_quant_points):
        old_idx = 0
        final_indices = np.zeros(matrix_shape, dtype=int)
        for idx_A, idx_k in idx_A_k_pair:
            final_indices[old_idx:idx_A] = idx_k
            old_idx = idx_A
        final_indices[old_idx:] = len_quant_points - 1
        return final_indices

这个想法是预先对数组 A 进行排序，然后在 k 的中点上使用 A 的 searchsorted。这给出了与以前相同的信息，因为它准确地告诉我们 A 的哪些点更接近 k 的哪些点。 _create_indices_matrix 方法将从这些信息中创建完整的索引数组，然后我们将对其进行排序以恢复 A 的原始顺序。为了利用缓慢变化的 k，我们保存最后的索引并确定我们必须更改哪些索引；然后我们只改变那些。对于缓慢变化的 k，这会产生出色的性能（但内存成本会更高）。

对于500万个元素的随机矩阵A和大约30个元素的k，重复实验60次，我们得到

Function search_sorted1; 15.72285795211792s
Function search_sorted2; 13.030786037445068s
Function query; 2.3306031227111816s <- the one with use_k_optimization = True
Function query; 4.81286096572876s   <- with use_k_optimization = False

scipy.spatial.KDTree.query 太慢了，我没有计时（不过超过 1 分钟）。这是用于计时的代码；还包含 search_sorted1 和 2 的实现。

import numpy as np
import scipy
import scipy.spatial
import time


A = np.random.rand(10000*500) #5 million elements
k = np.random.rand(32)
k.sort()

#first attempt, detailed in the answer, too
def search_sorted1(A, k):
    indicesClosest = np.searchsorted(k, A)
    flagToReduce = indicesClosest == k.shape[0]
    modifiedIndicesToAvoidOutOfBoundsException = indicesClosest.copy()
    modifiedIndicesToAvoidOutOfBoundsException[flagToReduce] -= 1

    flagToReduce = np.logical_or(flagToReduce,
                        np.abs(A-k[indicesClosest-1]) <
                        np.abs(A - k[modifiedIndicesToAvoidOutOfBoundsException]))
    flagToReduce = np.logical_and(indicesClosest > 0, flagToReduce)
    indicesClosest[flagToReduce] -= 1

    return indicesClosest

#taken from @Divakar answer linked in the comments under the question
def search_sorted2(A, k):
    indicesClosest = np.searchsorted(k, A, side="left").clip(max=k.size - 1)
    mask = (indicesClosest > 0) & \
           ((indicesClosest == len(k)) | (np.fabs(A - k[indicesClosest - 1]) < np.fabs(A - k[indicesClosest])))
    indicesClosest = indicesClosest - mask

    return indicesClosest
def kdquery1(A, k):
    d = scipy.spatial.cKDTree(k, compact_nodes=False, balanced_tree=False)
    _, indices = d.query(A)
    return indices

#After an indea on scipy mailing list
class SearchSorted:
    def __init__(self, tensor, use_k_optimization=True):

        '''
        Using this requires storing 4x the size of the tensor.
        If use_k_optimization is True, the class will assume that successive calls will be made with similar k.
        When this happens, we can cut the running time significantly by storing additional variables. If it won't be
        called with successive k, set the flag to False, as otherwise would just consume more memory for no
        good reason
        '''

        self.indices_sort = np.argsort(tensor)
        self.sorted_tensor = tensor[self.indices_sort]
        self.inv_indices_sort = np.argsort(self.indices_sort)
        self.use_k_optimization = use_k_optimization

        self.previous_indices_results = None
        self.prev_idx_A_k_pair = None

    def query(self, k):
        midpoints = k[:-1] + np.diff(k) / 2
        idx_count = np.searchsorted(self.sorted_tensor, midpoints)
        idx_A_k_pair = []
        count = 0

        old_obj = 0
        for obj in idx_count:
            if obj != old_obj:
                idx_A_k_pair.append((obj, count))
                old_obj = obj
            count += 1

        if not self.use_k_optimization or self.previous_indices_results is None:
            #creates the index matrix in the sorted case
            final_indices = self._create_indices_matrix(idx_A_k_pair, self.sorted_tensor.shape, len(k))
            #and now unsort it to match the original tensor position
            indicesClosest = final_indices[self.inv_indices_sort]
            if self.use_k_optimization:
                self.prev_idx_A_k_pair = idx_A_k_pair
                self.previous_indices_results = indicesClosest
            return indicesClosest

        old_indices_unsorted = self._create_indices_matrix(self.prev_idx_A_k_pair, self.sorted_tensor.shape, len(k))
        new_indices_unsorted = self._create_indices_matrix(idx_A_k_pair, self.sorted_tensor.shape, len(k))
        mask = new_indices_unsorted != old_indices_unsorted

        self.prev_idx_A_k_pair = idx_A_k_pair
        self.previous_indices_results[self.indices_sort[mask]] = new_indices_unsorted[mask]
        indicesClosest = self.previous_indices_results

        return indicesClosest

    @staticmethod
    def _create_indices_matrix(idx_A_k_pair, matrix_shape, len_quant_points):
        old_idx = 0
        final_indices = np.zeros(matrix_shape, dtype=int)
        for idx_A, idx_k in idx_A_k_pair:
            final_indices[old_idx:idx_A] = idx_k
            old_idx = idx_A
        final_indices[old_idx:] = len_quant_points - 1
        return final_indices

mySearchSorted = SearchSorted(A, use_k_optimization=True)
mySearchSorted2 = SearchSorted(A, use_k_optimization=False)
allFunctions = [search_sorted1, search_sorted2,
                mySearchSorted.query,
                mySearchSorted2.query]

print(np.array_equal(mySearchSorted.query(k), kdquery1(A, k)[1]))
print(np.array_equal(mySearchSorted.query(k), search_sorted2(A, k)[1]))
print(np.array_equal(mySearchSorted2.query(k), search_sorted2(A, k)[1]))

if __name__== '__main__':
    num_to_average = 3
    for func in allFunctions:
        if func.__name__ == 'search_sorted3':
            indices_sort = np.argsort(A)
            sA = A[indices_sort].copy()
            inv_indices_sort = np.argsort(indices_sort)
        else:
            sA = A.copy()
        if func.__name__ != 'query':
            func_to_use = lambda x: func(sA, x)
        else:
            func_to_use = func
        k_to_use = k
        start_time = time.time()
        for idx_average in range(num_to_average):
            for idx_repeat in range(10):
                k_to_use += (2*np.random.rand(*k.shape)-1)/100 #uniform between (-1/100, 1/100)
                k_to_use.sort()
                indices = func_to_use(k_to_use)
                if func.__name__ == 'search_sorted3':
                    indices = indices[inv_indices_sort]
                val = k[indices]

        end_time = time.time()
        total_time = end_time-start_time

        print('Function {}; {}s'.format(func.__name__, total_time))

我确信它仍然可以做得更好（我为 SerchSorted 类使用了大量空间，所以我们可能会节省一些东西）。如果您有任何改进的想法，请告诉我！

【讨论】：

注释代码看起来不错，包括doc strings（是的，我注意到了构造函数）。 Revamped/retargeted some，这应该是Code Review 的一个很好的问题。标记性能，numpy，也许还有 scipy。使用“Python 标签”需自担风险。个人资料中是否还有更多观察结果？ use_k_optimization 的效果如何“随 delta”变化？
（您可以通过为previous_indices_results 指定int8 来缓解一些内存压力。在我看来，使用self.previous_indices_results[mask] \ = new_indices_unsorted[self.inv_indices_sort[mask]] 可以提高可读性，并且indices_sort 不需要是实例属性不再。）（你用(2*np.random.rand(*k.shape)-1)/100 达到了一个甜蜜点 - …/20 并没有带来太多的加速，…/500 也没有太大改善。）
@greybeard 感谢您的建议！但是 int8 是否足以存储整个向量，因为它有超过 256 个元素？不过，我可能可以使用 int32 而不是 64。对于面具，我最初是这样写的，但后来我认为这是错误的；因为previous_indices 是A 的原始顺序，而掩码是在排序后的数组A 上完成的。所以如果我将掩码应用于previous_indices，我会得到错误的元素，对吗？有趣的是 ../100 似乎是最佳的，谢天谢地，这是我的典型行为:)
will int8 be enough to [index k], given that [the whole vector] has more than 256 elements? 我希望如此：示例ks (about 30 points) 可以由“uint5”索引。 I thought that is was wrong - 有趣的是，我不得不说服自己在左侧使用“前向排列”是正确的。并且相当于使用右边的“逆排列”。
(I had to convince myself that using "the forward permutation" on the left side was right我仍然对两者同样感到不安。)