在 Python 中跨两个列表/数组查找最近的项目答案

【问题标题】：finding nearest items across two lists/arrays in Python在 Python 中跨两个列表/数组查找最近的项目
【发布时间】：2013-02-28 02:17:25
【问题描述】：

我有两个包含浮点值的 numpy 数组 x 和 y。对于x 中的每个值，我想在y 中找到最接近的元素，而不重复使用y 中的元素。输出应该是从 x 到 y 的元素索引的 1-1 映射。这是一种依赖排序的坏方法。它从列表中删除每个配对的元素。不排序会很糟糕，因为配对将取决于原始输入数组的顺序。

def min_i(values):
    min_index, min_value = min(enumerate(values),
                               key=operator.itemgetter(1))
    return min_index, min_value

# unsorted elements
unsorted_x = randn(10)*10
unsorted_y = randn(10)*10

# sort lists
x = sort(unsorted_x)
y = sort(unsorted_y)

pairs = []
indx_to_search = range(len(y))

for x_indx, x_item in enumerate(x):
    if len(indx_to_search) == 0:
        print "ran out of items to match..."
        break
    # until match is found look for closest item
    possible_values = y[indx_to_search]
    nearest_indx, nearest_item = min_i(possible_values)
    orig_indx = indx_to_search[nearest_indx]
    # remove it
    indx_to_search.remove(orig_indx)
    pairs.append((x_indx, orig_indx))
print "paired items: "
for k,v in pairs:
    print x[k], " paired with ", y[v]

我更喜欢不先对元素进行排序，但如果它们已排序，那么我想获取原始未排序列表unsorted_x、unsorted_y 中的索引。在 numpy/scipy/Python 或使用 pandas 中执行此操作的最佳方法是什么？谢谢。

编辑：澄清一下，我并不是要找到所有元素的最佳拟合（例如，不是最小化距离总和），而是要找到每个元素的最佳拟合，如果是这样就可以了有时以牺牲其他元素为代价。我假设y 通常比x 大得多，与上面的示例相反，因此y 中x 的每个值通常都有很多非常好的拟合，我只想有效地找到那个。

有人可以为此展示 scipy kdtrees 的示例吗？文档很少

kdtree = scipy.spatial.cKDTree([x,y])
kdtree.query([-3]*10) # ?? unsure about what query takes as arg

【问题讨论】：

我认为用二分法查找索引的排序可能是你最好的选择。
@mgilton: scipy/numpy 中有内置的二分搜索算法吗？
是的：numpy.searchsorted
在我看来，您需要 np.sort、np.argsort 和 np.searchsorted 的组合来完成它。
@Jaime，不知道你的意思，你可以得到你用它查询的集合之外的点的 k 最近邻。 tree = KDTree(x[:,None]); tree.query(y[:,None], k=1) 为所有 y 找到最近的 x（基于二次范数，您可以更改它）。

标签： python numpy scipy pandas

【解决方案1】：

编辑 2 如果您可以选择一定数量的邻居来保证数组中的每个项目都有一个唯一的邻居，那么使用 KDTree 的解决方案可以执行得非常好。使用以下代码：

def nearest_neighbors_kd_tree(x, y, k) :
    x, y = map(np.asarray, (x, y))
    tree =scipy.spatial.cKDTree(y[:, None])    
    ordered_neighbors = tree.query(x[:, None], k)[1]
    nearest_neighbor = np.empty((len(x),), dtype=np.intp)
    nearest_neighbor.fill(-1)
    used_y = set()
    for j, neigh_j in enumerate(ordered_neighbors) :
        for k in neigh_j :
            if k not in used_y :
                nearest_neighbor[j] = k
                used_y.add(k)
                break
    return nearest_neighbor

和n=1000 点的样本，我得到：

In [9]: np.any(nearest_neighbors_kd_tree(x, y, 12) == -1)
Out[9]: True

In [10]: np.any(nearest_neighbors_kd_tree(x, y, 13) == -1)
Out[10]: False

所以最优是k=13，那么时机是：

In [11]: %timeit nearest_neighbors_kd_tree(x, y, 13)
100 loops, best of 3: 9.26 ms per loop

但在最坏的情况下，您可能需要k=1000，然后：

In [12]: %timeit nearest_neighbors_kd_tree(x, y, 1000)
1 loops, best of 3: 424 ms per loop

这比其他选项慢：

In [13]: %timeit nearest_neighbors(x, y)
10 loops, best of 3: 60 ms per loop

In [14]: %timeit nearest_neighbors_sorted(x, y)
10 loops, best of 3: 47.4 ms per loop

编辑在搜索之前对数组进行排序对于超过 1000 个项目的数组是有益的：

def nearest_neighbors_sorted(x, y) :
    x, y = map(np.asarray, (x, y))
    y_idx = np.argsort(y)
    y = y[y_idx]
    nearest_neighbor = np.empty((len(x),), dtype=np.intp)
    for j, xj in enumerate(x) :
        idx = np.searchsorted(y, xj)
        if idx == len(y) or idx != 0 and y[idx] - xj > xj - y[idx-1] :
            idx -= 1
        nearest_neighbor[j] = y_idx[idx]
        y = np.delete(y, idx)
        y_idx = np.delete(y_idx, idx)
    return nearest_neighbor

使用 10000 个元素的长数组：

In [2]: %timeit nearest_neighbors_sorted(x, y)
1 loops, best of 3: 557 ms per loop

In [3]: %timeit nearest_neighbors(x, y)
1 loops, best of 3: 1.53 s per loop

对于较小的数组，它的性能稍差。

如果只是为了丢弃重复项，您将不得不遍历所有项目以实现您的 greedy 最近邻算法。考虑到这一点，这是我能想到的最快的：

def nearest_neighbors(x, y) :
    x, y = map(np.asarray, (x, y))
    y = y.copy()
    y_idx = np.arange(len(y))
    nearest_neighbor = np.empty((len(x),), dtype=np.intp)
    for j, xj in enumerate(x) :
        idx = np.argmin(np.abs(y - xj))
        nearest_neighbor[j] = y_idx[idx]
        y = np.delete(y, idx)
        y_idx = np.delete(y_idx, idx)

    return nearest_neighbor

现在有了：

n = 1000
x = np.random.rand(n)
y = np.random.rand(2*n)

我明白了：

In [11]: %timeit nearest_neighbors(x, y)
10 loops, best of 3: 52.4 ms per loop

【讨论】：

谢谢。有没有使用cKDTree 不重复的方法？即使性能受到轻微影响？
另一个问题：有没有办法确保p.argmin(np.abs(y - xj)) 会忽略像 NaN 这样的缺失值？有没有会选择这些的情况？
np.nanargmin 是你想要的。
这种方法也可以用于多维点吗？，因为我总是得到错误：缓冲区在“tree =scipy.spatial.cKDTree( y[:, 无])"