在列表中查找邻居的最有效方法答案

【问题标题】：Most efficient way to find neighbors in list在列表中查找邻居的最有效方法
【发布时间】：2015-12-14 12:19:40
【问题描述】：

我有一个长度为 2016 的列表，但只有 242 个包含数据，其余设置为无。我的目标是在值之间进行插值，以使用简单形式的 IDW（反距离加权）填补所有空白。所以我脚本的任务是：

遍历 myList 的所有项目
如果 myList 包含一个值（即不是 None），只需复制它
如果在 myList 中找到“None”，则通过计算与 myList 中所有项目的距离来获取左右邻居的位置/值
计算两个邻居之间的差距的插值（他们离得越远，他们得到的权重越小）

假设我们有一个较小的列表，只有 14 个项目（5 个有效项目）：

myList = [26, None, None, None, 31, None, None, 58, None, 42, None, None, None, 79]
resultList = [None] * len(myList)

for i in range(len(myList):
    if not myList[i] is None:
        resultList[i] = myList[i]
    else:
        distance = [i - j for j in range(len(myList)) if not myList[j] is None]
        neighbors = min([n for n in dist if n>0]), max([n for n in dist if n<0])
        # rest of the interpolation (not important for my question):
        neighbors_c = [(1/float(n))**2 for n in neighbors]
        c_sum = sum(neighbors_c)
        neighbors_c = [n/c_sum for n in neighbors_c]
        resultList = myList[i-neighbors[0]]*neighbors_c[0] + myList[i-neighbors[1]]*neighbors_c[1]

我正在为许多数据集这样做。我发现这种方法每个数据集大约需要 0.59 秒。困扰我的是我的列表已全部排序，但我只需要其中的 2 个值。所以 99% 的距离都是白计算的。这导致我尝试了两个：在 i-j 变为负数后停止迭代，因为显然它遇到了最接近的值：

所以不是列表理解：

distance = [i - j for j in range(len(myList)) if not myList[j] is None]

我做了一个适当的 for 循环，在距离通过零后退出，因此再次变大：

dist = []
for j in range(len(myList)):
    if not myList[j] is None:
        dist.append(i-j)
        if i-j < 0: break

使用这种方法，我能够将每个数据集缩短到 0.38 秒。当迭代 myList 中的所有项目时，第二种方法在开始时很快（在第 2、3、4、... 循环之后命中项目并立即退出），但对最后的项目没有任何改进，因为迭代总是开始在 j=0。

我想知道您是否可以想到任何更快的方法来找到数据集中特定数字的两个邻居，而不必检查所有距离，而只取最大的负数和最小的正数。

另外，我对 python 很陌生，所以如果你在我的脚本中发现其他非 python 表达式，请告诉我。非常感谢你们！

【问题讨论】：

Numpy 提供了一些最近邻算法，你可以看看them
还有pandas.Series.interpolate 函数可以完成所有这些工作。
Inverse Distance Weighted (IDW) Interpolation with Python这个问题的答案怎么样？
为什么不先找到所有非 None 值的索引，然后再通过第二遍来填充所有内容？
albert：最近邻插值是不同的。但我确实偶然发现了我还不能调用的 KDTree 函数。到目前为止找不到我理解的示例，但如果其他解决方案不能完成这项工作，我可能会继续寻找。 pacholik：由于某种原因，无法安装 pandas 工具包。似乎干扰了 numpy ojdo：经典的 IDW 插值考虑了所有其他数据点。我想把它限制在两个邻居。我称它为 IDW，但实际上恐怕是不同的东西。 Mad：我就是这么做的，不是吗？

标签： python list python-2.7

【解决方案1】：

更新： 这是使用 numpy interp 的方法：

import numpy as np

myList = [26, None, None, None, 31, None, None, 58, None, 42, None, None, None, 79]

values = [(i, val) for i, val in enumerate(myList) if val is not None]

xp, fp = zip(*values)

print(xp) # (0, 4, 7, 9, 13)
print(fp) # (26, 31, 58, 42, 79)

result = np.interp(np.arange(len(myList)), xp, fp)
print(result) # [ 26.    27.25  28.5   29.75  31.    40.    49.    58.    50.    42.    51.25  60.5   69.75  79.  ]

原帖：

正如其他人已经建议的那样，最好使用已经在 numpy 或 pandas 中实现的插值。

但是为了完整起见，我想出了一个快速的解决方案：

myList = [26, None, None, None, 31, None, None, 58, None, 42, None, None, None, 79]

resultList = []

# first lets split the list into sublists that group the numbers
# and the Nones into groups
for i, item in enumerate(myList):
    if i == 0:
        resultList.append([item])
    else:
        if type(resultList[-1][-1]) == type(item):
            resultList[-1].append(item)
        else:
            resultList.append([item])

print(resultList) # [[26], [None, None, None], [31], [None, None], [58], [None], [42], [None, None, None], [79]]

# now lets interpolate the sublists that contain Nones
for i, item in enumerate(resultList):
    if item[0] is not None:
        continue

    # this is a bit problematic, what do we do if we have a None at the beginning or at the end?
    if i == 0 or i + 1 == len(resultList):
        continue

    prev_item = resultList[i - 1][-1]
    next_item = resultList[i + 1][0]

    difference = next_item - prev_item
    item_length = len(item) + 1

    for j, none_item in enumerate(item):
        item[j] = prev_item + float(j + 1) / item_length * difference

# flatten the list back
resultList = [item for sublist in resultList for item in sublist]

print(resultList) # [26, 27.25, 28.5, 29.75, 31, 40.0, 49.0, 58, 50.0, 42, 51.25, 60.5, 69.75, 79]

我建议您仅将其用于学习或简单案例，因为它不处理列表以 None 开头或结尾的案例

【讨论】：

感谢您提供两个答案！中间人。工具似乎是一种插入我的数据集的简单方法，但它只是线性的。我需要一种考虑距离并使用二次权重的方法。经典的 IDW 方法需要的时间太长，因此我想实现自己的想法。我需要仔细研究的上层解决方案。乍一看，它看起来不会更快，但也许我错过了一些重要的东西。不用担心第一项或最后一项是“无” - 我确保这永远不会发生。
对，第二个部分你可以自己实现插值，只需编辑内部 for 循环 :)