Python - 优化 N 中的 2 与 N 非常大的组合答案

【问题标题】：Python - Optimizing combination of 2 among N with N very largePython - 优化 N 中的 2 与 N 非常大的组合
【发布时间】：2019-06-14 07:11:39
【问题描述】：

我试图找到满足特定条件的元素对。更准确地说，我想在 50,000 个元素中形成 2 个（无序）元素的组合，以便遵守某个条件。

我的数据集包含 50,000 个具有唯一标识符和一些可观察值（位置和截止点）的元素。我想形成 2 个元素的无序对，使得两个成对元素之间的距离低于给定的截止值。

到目前为止，我的脚本如下。

# Load the dataset (I have a custom function for it called loadFile)
df = loadFile(path_input,filename_input)

# Reset the index because I want to use the column "index" from 0 to 49,999
df = df.reset_index(drop=False)

# Initiate the list of pairs & get the number of elements
pairs = list()
nb_rows = df.shape[0]

# Loop over all the rows of my dataframe
for ind_x, x in df.iterrows():
    # Just print to know where we stand from 1 to 50,000
    print("{} out of {}".format(ind_x+1,nb_rows))
    # Loop over all the rows of my dataframe
    for ind_y, y in df.iterrows():
        # We only consider the y-row if it was not covered yet by the previous pairs
        # I also don't want to cover the case where both elements are equal
        if ind_x<ind_y:
            # Here is a custom condition (a simple function) returning a boolean
            if distance(x["location"],y["location"])<x["cutoff"]:
                pairs.append((x["id"],y["id"]))

实际上，如果我的自定义条件始终得到遵守，我的脚本可以遍历所有 50,000 * 49,999 / 2 ~ 1 2.5 亿 个可能的对..

对于一个“ind_x”元素，当前循环运行大约需要 5 秒，这使得运行脚本需要 50,000 * 5 / (60²) = 69 小时（很多）。

有什么方法可以加快我的脚本，无论是循环本身还是修改我的方法以节省时间？

提前谢谢你，

【问题讨论】：

iterrows() 一般不推荐使用。
您好约瑟夫，感谢您的回答。你建议我改用 .itertuples 吗？
medium.com/@rtjeannier/pandas-101-cont-9d061cb73bfc看看这篇文章。
如果能够将两帧完全合并，那么判断是否满足条件就很简单了，而且速度也快很多。但是，仅包含与您的循环匹配的行（排除先前匹配的匹配项）的逻辑可能难以实现。至少，这将其简化为最多 50,000 个组的单个循环，我猜它可以在没有它的情况下完成。

标签： python-3.x pandas performance combinations large-data

【解决方案1】：

这只是寻找邻域集的经典问题。只要你的距离是欧几里得，有很多专门的包用快速的方法来解决它，但一个不错的选择是利用 scipy 的cKDTree：

from scipy.spatial import cKDTree

def close_point_pairs(X, max_d):
    # create the tree from the data points
    tree = cKDTree(X)

    # find all pairs of points 
    pairs = tree.query_ball_tree(tree,max_d)

    # pair indices
    return np.array([(x, pt) for x, ptlist in enumerate(pairs) for pt in ptlist if pt>x])

这将创建一个包含所有索引对的 numpy 数组。它非常快，大部分运行时间都被最后一个对扩展所消耗：

df = pd.DataFrame(500*np.random.random(size=10**4), columns=['location'])
%timeit close_point_pairs(df['location'].values[:,np.newaxis], max_d=1.0)
530 ms ± 123 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

注意我必须添加 np.newaxis 因为这些点只是一维的，不清楚你的位置点是什么，但如果它们的维度更高，你应该删除它。

如果您需要原始 DataFrame 中的唯一 id，您可以直接索引到它或创建一个翻译字典。

【讨论】：

非常感谢！我会仔细研究的！