在我的情况下，双 iterrows() 循环太慢了答案

【问题标题】：Double iterrows() loop too slow in my case在我的情况下，双 iterrows() 循环太慢了
【发布时间】：2019-07-16 10:35:10
【问题描述】：

我的目标是使用“模拟”文件来规范“输入”文件。必须这样做的方式是，如果模拟文件中的条目在同一组中并且其位置在位置开始和位置结束之间的间隔中，我必须从data_value 中减去“模拟”分数。

下面我介绍一个简化的案例，实际表格要大得多，我的解决方案不够快。我一直在寻找替代品，但到目前为止似乎没有什么能解决我的问题。我相信有更快的方法来解决这个问题，希望有人能帮助我。

我编写的代码完全符合我的要求：

import pandas as pd

test_in_dict = {'group': [1, 1, 1, 2, 2, 2], 
                'position_start' :[10,20,30, 40, 50, 60], 
                'position_end' : [15, 25, 35, 45, 55, 65], 
                'data_values' : [11, 12, 13, 14, 15, 16]}
test_in = pd.DataFrame(data=test_in_dict)

test_mock_dict = {'group_m': [1, 1, 1, 1, 2, 2, 2, 2], 
                  'position_m' : [11, 16, 20, 52, 42, 47, 12, 65], 
                  'score_m': [1, 1, 2, 1, 3, 1, 2, 1]}
test_mock = pd.DataFrame(data=test_mock_dict)

for index_in, row_in in test_in.iterrows():
    for index_m, row_m in test_mock.iterrows():
        if (row_in['group'] == row_m['group_m']) & \
        (row_m['position_m'] >= row_in['position_start']) & \
        (row_m['position_m'] < row_in['position_end']):
            row_in['data_values'] = row_in['data_values'] - row_m['score_m']

如何编写与上面代码相同的东西，但避免双循环使我处于 O(NxM) 复杂性中，N 和 M 都很大（模拟文件的条目比 in 文件多）？

【问题讨论】：

您不能使用字典按group 或group_m 对行进行分组吗？
如果没有仅使用两个数据框的更优雅、更强大的解决方案，我可能会尝试这样做。我猜这将有助于大大降低复杂性。

标签： python pandas

【解决方案1】：

你想要的是一个典型的join 问题。在 pandas 中，我们为此使用 merge 方法。您可以将 itterrows 循环重写为这段代码，它会更快，因为我们使用矢量化方法：

# first merge your two dataframes on the key column 'group' and 'group_m'
common = pd.merge(test_in, 
                    test_mock, 
                    left_on='group', 
                    right_on='group_m')

# after that filter the rows you need with the between method 
df_filter = common[(common.position_m >= common.position_start) & (common.position_m < common.position_end)]

# apply the calculation that is needed on column 'data_values'
df_filter['data_values'] = df_filter['data_values'] - df_filter['score_m']

# drop the columns we dont need
df_filter = df_filter[['group', 'position_start', 'position_end', 'data_values']].reset_index(drop=True)

# now we need to get the rows from the original dataframe 'test_in' which did not get filtered
unmatch = test_in[(test_in.group.isin(df_filter.group)) & (~test_in.position_start.isin(df_filter.position_start)) & (~test_in.position_end.isin(df_filter.position_end))]

# finally we can concat these two together
df_final = pd.concat([df_filter, unmatch], ignore_index=True)

输出





    group   position_start  position_end    data_values
0   1       10              15              10
1   1       20              25              10
2   2       40              45              11
3   1       30              35              13
4   2       50              55              15
5   2       60              65              16

【讨论】：

请通过编辑您的问题提供正确的示例输出。
我检查一下，给我一分钟
修改后的数据值看起来没问题，只有来自 test_in 的未修改值行在此输出中丢失。
@codeprimate123 很抱歉这个错误，这应该有效，请检查。
我很抱歉没有在问题中包含输出，下次会做，这看起来不错，经过简单的排序后，我得到了我需要的东西。现在我为我的真实用例修改代码，看看它运行得有多快，我的双循环在 3 小时内还没有完成......

【解决方案2】：

接受的答案已经到位并且应该可以工作，但是由于 OP 的数据很大，他无法使解决方案工作。所以我想尝试一个实验性的答案，这就是为什么我将其添加为另一个答案而不是编辑我已经接受的答案：

解决方案的额外步骤：正如我们所见，cardinality 变为 many-to-many，因为在两个名为 group & group_m 的 key columns 中有重复项。

所以我查看了数据，发现每个position_start 值都被舍入到base 10。因此，我们可以通过在第二个 df 'test_mock' 中创建一个名为 position_m_round 的人工键列来减少基数，如下所示：

# make a function which rounds integers to the nearest base 10
def myround(x, base=10):
    return int(base * round(float(x)/base))

# apply this function to our 'position_m' column and create a new key column to join
test_mock['position_m_round'] = test_mock.position_m.apply(lambda x: myround(x))

    group_m position_m  score_m position_m_round
0   1       11          1       10
1   1       16          1       20
2   1       20          2       20
3   1       52          1       50
4   2       42          3       40

# do the merge again, but now we reduce cardinality because we have two keys to join
common = pd.merge(test_in, 
                    test_mock, 
                    left_on=['group', 'position_start'],
                    right_on=['group_m', 'position_m_round'])

'''
this part becomes the same as the original answer
'''

# after that filter the rows you need with the between method 
df_filter = common[(common.position_m >= common.position_start) & (common.position_m < common.position_end)]

# apply the calculation that is needed on column 'data_values'
df_filter['data_values'] = df_filter['data_values'] - df_filter['score_m']

# drop the columns we dont need
df_filter = df_filter[['group', 'position_start', 'position_end', 'data_values']].reset_index(drop=True)

# now we need to get the rows from the original dataframe 'test_in' which did not get filtered
unmatch = test_in[(test_in.group.isin(df_filter.group)) & (~test_in.position_start.isin(df_filter.position_start)) & (~test_in.position_end.isin(df_filter.position_end))]

# finally we can concat these two together
df_final = pd.concat([df_filter, unmatch], ignore_index=True)

输出

    group   position_start  position_end    data_values
0   1       10              15              10
1   1       20              25              10
2   2       40              45              11
3   1       30              35              13
4   2       50              55              15
5   2       60              65              16

【讨论】：

我对测试数据进行了四舍五入，以便在我的脑海中更容易计算，实际上位置，position_starts 和 position_end 是巨大的整数，不像我的小例子那样四舍五入。我发现我可以通过使用单个 iterrows() 循环进行预过滤来减少模拟数据大小，从而使您的解决方案发挥作用。但是，由于我仍然可以看到 RAM 使用量可能会爆炸的情况，并且我希望有一个强大的解决方案，我将使用一个专为计算交叉点而设计的工具（它使用更优化的数据结构，而不是表，但输入和输出可以是表）。谢谢你，我学到了很多！
我明白，在这种情况下，这确实行不通。祝你好运，不用谢！