在 Python 中对单个数组进行更快的双重迭代答案

【问题标题】：Faster double iteration over a single array in Python在 Python 中对单个数组进行更快的双重迭代
【发布时间】：2019-09-15 03:11:55
【问题描述】：

我想找到一种方法来更快地计算成对精度，即比较同一数组的元素（在本例中是 panda df 列）计算它们的差异，然后比较获得的两个结果。我将有一个数据框 df 与 3 列（id 的文档，Jugment 代表人类评估，它是一个 int 对象，PR_score 代表该文档的 pagerank 并且它是一个浮动对象），我想检查他们是否同意将一个文档更好/最差地分类另一个文档。

例如：

id：id1、id2、id3

判断 : 1, 0, 0

PR_score：0.18、0.5、0.12

在这种情况下，两个分数一致认为 id1 比 id3 更好，在 id1 和 id2 上不一致，并且 id2 和 id3 之间存在人为判断关系，因此我的成对准确度是：

协议 = 1

不同意 = 1

成对准确度 = 一致/（一致+不一致）=1/2 = 0.5

这是我的第一个解决方案的代码，其中我使用 df 的列作为数组（这有助于减少计算时间）：

def pairwise(agree, disagree):
    return(agree/(agree+disagree))

def pairwise_computing_array(df):

    humanScores = np.array(df['Judgement'])  
    pagerankScores =  np.array(df['PR_Score']) 

    total = 0 
    agree = 0
    disagree = 0

    for i in range(len(df)-1):  
        for j in range(i+1, len(df)):
            total += 1
            human = humanScores[i] -  humanScores[j] #difference human judg
            if human != 0:
                pr = pagerankScores[i] -  pagerankScores[j]#difference pagerank score
                if pr != 0:
                    if np.sign(human) == np.sign(pr):  
                        agree += 1 #they agree in which of the two is better
                    else:
                        disagree +=1 #they do not agree in which of the two is better
                else:
                    continue;   
            else:
                continue;

    pairwise_accuracy = pairwise(agree, disagree)

    return(agree, disagree, total,  pairwise_accuracy)

我尝试使用列表推导以获得更快的计算，但它实际上比第一个解决方案慢：

def pairwise_computing_list_comprehension(df):

    humanScores = np.array(df['Judgement'])  
    pagerankScores =  np.array(judgmentPR['PR_Score']) 

    sign = [np.sign(pagerankScores[i] - pagerankScores[j]) == np.sign(humanScores[i] - humanScores[j] ) 
            for i in range(len(df)) for j in range(i+1, len(df)) 
                if (np.sign(pagerankScores[i] - pagerankScores[j]) != 0 
                    and np.sign(humanScores[i] - humanScores[j])!=0)]

    agreement = sum(sign)
    disagreement = len(sign) -  agreement                             
    pairwise_accuracy = pairwise(agreement, disagreement)

    return(agreement, disagreement, pairwise_accuracy)

我无法在整个数据集上运行，因为它需要太多时间，我希望能在理想情况下在 1 分钟内完成计算。

在我的计算机上计算 1000 行的一小部分达到了这个性能：

代码1：每个循环 1.57 秒 ± 3.15 毫秒（平均值 ± 标准偏差。7 次运行，每个循环 1 个）

代码2：每个循环 3.51 秒 ± 10.7 毫秒（平均值 ± 标准偏差，7 次运行，每个循环 1 个）

【问题讨论】：

您的列表理解方法会变慢，因为它会创建一个不必要的列表，您稍后必须对其进行总结，此外，您还需要多次重新计算 pagerankScores[i] - pagerankScores[j]。在任何情况下，您的第一种方法是在原始、数字 numpy.ndarray 对象上使用像 numba 这样的工具可能会产生改进，尽管您仍然会遇到二次时间复杂度，因为您正在进行成对比较。或许您可以提供一些示例数据？
从根本上说，使用 Python for 循环迭代 numpy.ndarray 对象，尤其是使用 for i in range(len(df)-1) 会慢。尝试使用list 对象和您的第一种方法df['Judgement'].values.tolist()，您可能会看到显着的改进，但使用numba 可以做得更好
谢谢你，我采纳了你的建议，它奏效了。我使用列表而不是 numpy 数组的第一种方法，并在函数 @jit(nopython = True) 之前用作 numba 包中的装饰器 jit。我的整个数据集（58krow）的最终解决方案只需几秒钟
如果使用numba，请使用numpy数组！
@kmario23 here you can find a sample of 100 row，我不知道是否有更好的方法来共享文件，如果有，请告诉我。顺便说一句，我不明白你怎么能只用一个循环来做这个比较。

标签： python python-3.x pandas performance numpy

【解决方案1】：

你有 numpy 数组，那么为什么不直接使用它呢？您可以将工作从 Python 转移到 C 编译代码（通常但不总是）：

首先，将向量调整为 1xN 矩阵：

humanScores = np.array(df['Judgement']).resize((1,-1))
pagerankScores =  np.array(judgmentPR['PR_Score']).resize((1,-1))

然后找出区别，我们只对标志感兴趣：

humanDiff = (humanScores - humanScores.T).clip(-1,1)
pagerankDiff = (pagerankScores - pagerankScores.T).clip(-1,1)

这里我假设数据是整数，所以clip函数只会产生-1、0或1。然后你就可以数了：

agree = ((humanDiff != 0) & (pagerankDiff != 0) & (humanDiff == pagerankDiff)).sum()
disagree = ((humanDiff != 0) & (pagerankDiff != 0) & (humanDiff != pagerankDiff)).sum()

但上述计数是双重计数，因为项目 (i,j) 和项目 (j,i) 在 humanDiff 和 pagerankDiff 中都是完全相反的符号。您可以考虑在求和中只取方阵的上三角部分：

agree = ((humanDiff != 0) &
         (pagerankDiff != 0) &
         (np.triu(humanDiff) == np.triu(pagerankDiff))
        ).sum()

【讨论】：

不幸的是，就像您在示例中看到的那样，pagerankScore 是 float 而不是 int。我尝试了您的解决方案并进行了一些更改，但仍然太慢（但比我的解决方案快）。
在这种情况下，您可以尝试np.sign(humanScores - humanScores.T) 来获得-1、0 或1。但毕竟，您正在检查N^2 对并且您不能比O(N^2) 快.考虑改进的一种方法是对分数进行排序，然后找出 humanScore 和 pagerankScore 之间有多少乱序。这将是算法的改变。

【解决方案2】：

这是在合理时间内工作的代码，感谢@juanpa.arrivillaga 的建议：

from numba import jit

@jit(nopython = True)
def pairwise_computing(humanScores, pagerankScores):

    total = 0 
    agree = 0
    disagree = 0

    for i in range(len(humanScores)-1):  
        for j in range(i+1, len(humanScores)):
            total += 1
            human = humanScores[i] -  humanScores[j] #difference human judg
            if human != 0:
                pr = pagerankScores[i] -  pagerankScores[j]#difference pagerank score
                if pr != 0:
                    if np.sign(human) == np.sign(pr):  
                        agree += 1 #they agree in which of the two is better
                    else:
                        disagree +=1 #they do not agree in which of the two is better
                else:
                    continue   
            else:
                continue
    pairwise_accuracy = agree/(agree+disagree)
    return(agree, disagree, total,  pairwise_accuracy)

这是我的整个数据集（58k 行）达到的时间性能：

每个循环 7.98 秒 ± 2.78 毫秒（平均值 ± 标准偏差，7 次运行，每个循环 1 个）

【讨论】：

【解决方案3】：

可以通过利用广播摆脱内部for 循环，因为索引j 总是比索引i 领先1（即我们不会回头）。但是在以下行中计算同意/不同意存在一个小问题：

if np.sign(human) == np.sign(pr):

我不知道如何解决。因此，我只是在此处提供骨架代码以进行更多调整并使其正常工作，因为您更了解问题所在。就是这样：

def pairwise_computing_array(df):

    humanScores = df['Judgement'].values
    pagerankScores = df['PR_Score'].values 

    total = 0 
    agree = 0
    disagree = 0

    for i in range(len(df)-1):
        j = i+1
        human = humanScores[i] -  humanScores[j:]   #difference human judg
        human_mask = human != 0
        if np.sum(human_mask) > 0:  # check for at least one positive case
            pr = pagerankScores[i] -  pagerankScores[j:][human_mask]  #difference pagerank score
            pr_mask = pr !=0
            if np.sum(pr_mask) > 0:  # check for at least one positive case
                # TODO: issue arises here; how to resolve when (human.shape != pr.shape) ?
                # once this `if ... else` block is fixed, it's done
                if np.sign(human) == np.sign(pr):
                    agree += 1   #they agree in which of the two is better
                else:
                    disagree +=1   #they do not agree in which of the two is better
            else:
                continue
        else:
            continue
    pairwise_accuracy = pairwise(agree, disagree)

    return(agree, disagree, total,  pairwise_accuracy)

【讨论】：

我尝试使用您的解决方案并将其放在 forcompare = lambda x, y: np.sign(x)==np.sign(y) 之外，并将其放在 unTied = np.intersect1d(np.where(human!=0), np.where(pr!=0)) 和 results = compare(human[unTied], pr[unTied]) 之后。但它比其他解决方案慢，因为它无法使用 numba（numba 不支持很多 numpy 函数）