两个数据帧之间的高效逐行比较答案

【问题标题】：Efficient row-wise comparisons between two dataframes两个数据帧之间的高效逐行比较
【发布时间】：2019-01-02 09:46:20
【问题描述】：

我正在逐行比较两个数据帧。

对于data中的每一行，我想检查reference中是否有匹配的行。

为了使匹配被认为是真实的，必须满足一些条件：

我希望两行中的非空值数量相同（这样我就不会从数据中的行中得到误报，只匹配引用中一行的一部分）
我想避免比较 NaN 值，所以只比较包含实际值的行部分（因此第一个条件必须为真）
我希望在进行比较时允许一些容忍度（我正在使用 np.isclose 这样做）
我希望代码更快

找到匹配项后，我将两行的名称附加到列表中。如果没有匹配项，我会在与上面相同的列表中附加“未找到”的数据行名称。最后我创建了一个汇总表来查看哪一行对应（或不对应）什么。

让您了解我的数据框的结构：

    name    col1    col2    col3    col4    col5    col6    col7    col8        
0     X       10     20      30      40      50      60      70      80 
1     X       20     30      NaN     NaN     NaN     NaN     NaN     NaN
2     X       10     25      30      50      NaN     NaN     NaN     NaN
3     X       20     25      30      50      NaN     NaN     NaN     NaN

我的数据框有 ~130 列
大多数时候，它们会有 2 到 20 个数值，其余的是 NaN。
在每一行中，数值在行首按升序排序，NaN 在行尾。

我有一个使用 2 个 for 循环的工作代码，但在大数据帧上使用时速度相当慢（这里我在一些“示例”数据帧上测试代码）：

data = pd.DataFrame({'name':['read 1','read 2','read 3','read 4'],
                  'start 1':[100,102,100,103],
                  'end 1':[198,504,500,200],
                  'start 2':[np.NaN,600,650,601],
                  'end 2':[np.NaN,699, 700,702],
                  'start 3':[np.NaN,800,800,np.NaN],
                  'end 3':[np.NaN,901, 900,np.NaN]}, 
                   columns=['name', 'start 1', 'end 1', 'start 2', 'end 2', 'start 3', 'end 3'], 
                   dtype='float64')


reference = pd.DataFrame({'name':['a-1','a-2','b-1','c-1'],
                  'start 1':[100,100,100,300],
                  'end 1':[200,200,500,400],
                  'start 2':[300,np.NaN,600,600],
                  'end 2':[400,np.NaN, 700,700],
                  'start 3':[np.NaN,np.NaN,800,np.NaN],
                  'end 3':[np.NaN,np.NaN, 900,np.NaN]}, 
                   columns=['name', 'start 1', 'end 1', 'start 2', 'end 2', 'start 3', 'end 3'], 
                   dtype='float64')



match = []
checklist = set()

for read in data.itertuples():

    ndata = np.count_nonzero(~np.isnan(read[2:]),axis=0)

    end = ndata+1 if ndata>2 and  read[1] not in checklist else 4

    for ref in reference.itertuples():

        nref = np.count_nonzero(~np.isnan(ref[2:]),axis=0)

        if np.isclose(read[2:end],ref[2:end], atol=5).all() == True and ndata == nref:
            match.append([read[1], ref[1]])
            checklist.add(read[1])
            break

    if read[1] not in checklist:
        match.append([read[1], "not found"])
        checklist.add(read[1])     

match_table = pd.DataFrame(match)


match_table:

    read name     reference
0     read 1         a-2
1     read 2         b-1
2     read 3      not found
3     read 4      not found

所以我决定尝试使用矢量化来优化它。现在我只使用了 1 个 for 循环，并且能够使用 np.isclose 对第三个条件进行矢量化，但没有针对其他条件进行管理。

我可以通过允许 equal_nan=True 绕过它，但由于我的大多数行都充满了 NaN 值，我想如果我不必进行这些比较，我会获得一些时间。

这是我目前得到的：

count = []

for read in data.itertuples(index=False):

    idx = np.argwhere(np.isclose(read[1:], reference.iloc[:,1:], atol=5, equal_nan=True).all(axis=1) == True).flatten()

    if idx.size == 0:
        count.append([read[0], "not found"]) 
    else:
        idx = idx.item()
        count.append([read[0], reference['name'][idx]])

match = pd.DataFrame(count)

我在 25×130 reference 数据帧上使用 400×130 data 数据帧对其进行了测试，它的执行速度比第一个版本快 6 倍，但仍然需要 1 秒才能完成。但也许没有太大的改进空间。

问题：

如何向量化处理条件 1 和 2 的操作，从而允许不执行 NaN 比较？
是否可以摆脱内部 for 循环？如果是，那是否可以提高速度？

额外问题：

为什么我必须在第一版和第二版代码之间将索引从 read[1] 更改为 read[0] 才能选择 ['name'] 列？似乎在一个版本中它是基于 0 的，而在另一个版本中则不是，或者类似的东西。但是作为python新手和自学，我真的不明白这里发生了什么..

【问题讨论】：

开始和结束值都是整数还是空值？
@HaleemurAli 是的。除了包含行名称字符串的第一列之外，所有其他列都填充了浮点值或空值（就像在我的数据框结构的示例中一样）

标签： arrays python-3.x numpy dataframe vectorization

【解决方案1】：

您的循环可以通过使用df.apply 来避免。 itertuples 很慢，只应在绝对必要时使用。

# index-setting not technically required, but makes the 
# rest of the code simpler
data = data.set_index('name')
reference = reference.set_index('name')

# define a helper function to use with apply
# taking the same logic as you have used
def get_ref(x):
    m = np.isclose(x, reference.values, atol=5, equal_nan=True).all(axis=1)
    return reference.index[m].item() if m.any() else np.nan

out = data.apply(get_ref, axis=1).rename('reference').reset_index()
# Outputs:
     name reference
0  read 1       a-2
1  read 2       b-1
2  read 3       NaN
3  read 4       NaN

如果你进入 numpy 层和用户np.apply_along_axis，你可以获得额外的速度提升

pd.DataFrame({'read name': data.index,
              'reference': np.apply_along_axis(get_ref, 1, data.values)}

时间：

在我的机器上，带有示例数据

numpy 版本大约需要 920 微秒
pandas 应用版本大约需要 1.35 毫秒
优化后的版本大约需要 2.20 毫秒

【讨论】：

谢谢！它在我上面给出的例子中非常有效。但是，df.apply() 的第一个解决方案在我的真实数据帧上完美运行，但np.apply_along_axis() 的第二个解决方案给我错误："could not convert string to float"。知道它来自哪里吗？唯一的“字符串”在第一列中设置为第一行代码的索引..
编辑：这似乎是由于我的字符串中有.。在最后创建数据框时，它显然试图将字符串转换为浮点数，因此出现错误。
用"not found" 替换np.NaN 解决了这个问题。