根据大型 Pandas 数据框中的条件链接行对答案

【问题标题】：Link pairs of rows based on a condition in a large Pandas Dataframe根据大型 Pandas 数据框中的条件链接行对
【发布时间】：2021-09-15 19:30:36
【问题描述】：

我有一个包含大约 150K 观察值的数据框。如果存在循环引用，目标是找到成对的行。我在下面编写了这段代码，它在一个小数据集上完成了这项工作，但速度非常慢。有人可以帮助使这段代码运行得更快吗？

这里的想法是，如果A引用B和B引用A，它们应该是链接的。

例子：

df_test = pd.DataFrame({'c1': ['A', 'C', 'D', 'B'], 'c2': ['B', 'D', 'C', 'A'], 'id':[1,2,3,4]})

# I created a "direction" knowing which way it is being referenced 
df_test['direction'] = df_test['c1'] + df_test['c2']

new_df = pd.DataFrame()
for index, row in df_test.iterrows():
    direction_to_serach = row['c2']+row['c1']
    df_test.drop(index, inplace=True)
    for index2, row2 in df_test.iterrows():
        if row2['direction'] == direction_to_serach:
            df_temp = pd.concat([pd.DataFrame(row.values).T,pd.DataFrame(row2.values).T], axis=1, ignore_index=True)
            new_df = new_df.append(df_temp)
            df_test.drop(index2, inplace=True)
    if df_test.empty:
        break

【问题讨论】：

当您提出问题时，请花时间回答那些试图帮助您的人。 Take the tour

标签： python pandas bigdata data-analysis network-analysis

【解决方案1】：

无论列顺序如何，尝试使用 self-merge，然后是 drop_duplicates：

new_df = df_test.merge(df_test, left_on=['c1', 'c2'], right_on=['c2', 'c1'])
p = np.sort(new_df[['c1_x', 'c2_x']].values)
new_df['pair'] = p[:, 0] + p[:, 1]
new_df = new_df.drop_duplicates('pair')

new_df:

  c1_x c2_x  id_x c1_y c2_y  id_y pair
0    A    B     1    B    A     4   AB
1    C    D     2    D    C     3   CD

或者如果需要“方向”：

df_test['direction'] = df_test['c1'] + df_test['c2']
new_df = df_test.merge(df_test, left_on=['c1', 'c2'], right_on=['c2', 'c1'])
p = np.sort(new_df[['c1_x', 'c2_x']].values)
new_df['pair'] = p[:, 0] + p[:, 1]
new_df = new_df.drop_duplicates('pair')

new_df:

  c1_x c2_x  id_x direction_x c1_y c2_y  id_y direction_y pair
0    A    B     1          AB    B    A     4          BA   AB
1    C    D     2          CD    D    C     3          DC   CD

pair 如果不需要，可以是dropped：

new_df = new_df.drop('pair', axis=1)

【讨论】：

【解决方案2】：

您可以使用遍历 c1 和 c2，对值对进行排序并从中创建一个字符串，然后使用 pd.Categorical 这将为您提供每个组的唯一编号：

df_test['group'] = pd.Categorical([''.join(sorted(x)) for x in df[['c1', 'c2']].values]).codes

  c1 c2  id  group
0  A  B   1      0
1  C  D   2      1
2  D  C   3      1
3  B  A   4      0

【讨论】：

【解决方案3】：

由于我之前的回答与@HenryEcker 非常相似，我建议您使用networkx 提供另一种解决方案：

import networkx as nx

G = nx.from_pandas_edgelist(df_test, 'c1', 'c2', create_using=nx.DiGraph())

cycles = nx.simple_cycles(G)

>>> for edge in cycles:
        print(edge)
['D', 'C']
['A', 'B']

您可以重新排列您的列 c1 和 c2 并查找重复项：

df_test[['c3', 'c4']] = np.sort(df_test[['c1', 'c2']])

>>> df_test
  c1 c2  id direction c3 c4
0  A  B   1        AB  A  B
1  C  D   2        CD  C  D
2  D  C   3        DC  C  D
3  B  A   4        BA  A  B

>>> df_test.drop_duplicates(['c3', 'c4'])
  c1 c2  id direction c3 c4
0  A  B   1        AB  A  B
1  C  D   2        CD  C  D

【讨论】：