【问题标题】:pandas join tables on two columns without ordering of values熊猫在两列上连接表,而不对值进行排序
【发布时间】:2022-01-10 13:28:22
【问题描述】:

我想实现这里描述的内容:stackoverflow question,但只能使用标准 pandas。

我有两个数据框: 拳头

  first_employee target_employee  relationship
0            Andy          Claude             0
1            Andy           Frida            20
2            Andy         Georgia           -10
3            Andy            Joan            30
4            Andy             Lee           -10
5            Andy           Pablo           -10
6            Andy         Vincent            20
7          Claude           Frida             0
8          Claude         Georgia            90
9          Claude            Joan             0
10         Claude             Lee             0
11         Claude           Pablo            10
12         Claude         Vincent             0
13          Frida         Georgia             0
14          Frida            Joan             0
15          Frida             Lee             0
16          Frida           Pablo            50
17          Frida         Vincent            60
18        Georgia            Joan             0
19        Georgia             Lee            10
20        Georgia           Pablo             0
21        Georgia         Vincent             0
22           Joan             Lee            70
23           Joan           Pablo             0
24           Joan         Vincent            10
25            Lee           Pablo             0
26            Lee         Vincent             0
27          Pablo         Vincent           -20

第二:

   first_employee target_employee  book_count
0         Vincent           Frida           2
1         Vincent           Pablo           1
2            Andy          Claude           1
3            Andy            Joan           1
4            Andy           Pablo           1
5            Andy             Lee           1
6            Andy           Frida           1
7            Andy         Georgia           1
8          Claude         Georgia           3
9            Joan             Lee           3
10          Pablo           Frida           2

我想加入这两个数据帧,这样我的最终数据帧与第一个数据帧相同,但它还有 book_count 列和相应的值(如果不可用,则为 NaN)。

我已经写过类似的东西:joined_df = first_df.merge(second_df, on = ['first_employee', 'target_employee'], how = 'outer') 我得到了:

   first_employee target_employee  relationship  book_count
0            Andy          Claude           0.0         1.0
1            Andy           Frida          20.0         1.0
2            Andy         Georgia         -10.0         1.0
3            Andy            Joan          30.0         1.0
4            Andy             Lee         -10.0         1.0
5            Andy           Pablo         -10.0         1.0
6            Andy         Vincent          20.0         NaN
7          Claude           Frida           0.0         NaN
8          Claude         Georgia          90.0         3.0
9          Claude            Joan           0.0         NaN
10         Claude             Lee           0.0         NaN
11         Claude           Pablo          10.0         NaN
12         Claude         Vincent           0.0         NaN
13          Frida         Georgia           0.0         NaN
14          Frida            Joan           0.0         NaN
15          Frida             Lee           0.0         NaN
16          Frida           Pablo          50.0         NaN
17          Frida         Vincent          60.0         NaN
18        Georgia            Joan           0.0         NaN
19        Georgia             Lee          10.0         NaN
20        Georgia           Pablo           0.0         NaN
21        Georgia         Vincent           0.0         NaN
22           Joan             Lee          70.0         3.0
23           Joan           Pablo           0.0         NaN
24           Joan         Vincent          10.0         NaN
25            Lee           Pablo           0.0         NaN
26            Lee         Vincent           0.0         NaN
27          Pablo         Vincent         -20.0         NaN
28        Vincent           Frida           NaN         2.0
29        Vincent           Pablo           NaN         1.0
30          Pablo           Frida           NaN         2.0

它有点接近我想要实现的目标。但是,first_employeetarget_employee 中值的顺序无关紧要,所以如果在第一个数据框中我有 (Frida,Vincent) 和第二个 (Vincent, Frida),这两个应该合并在一起(重要的是值,而不是按列的顺序)。

在我生成的数据框中,我得到了三个额外的行:

28        Vincent           Frida           NaN         2.0
29        Vincent           Pablo           NaN         1.0
30          Pablo           Frida           NaN         2.0

这是我合并的结果,它考虑“有序”值列以进行连接:这 3 个额外的行应该合并到已经可用的对 (Frida, Vincent) (Pablo, Vincent)(Frida, Pablo)

有没有办法只使用标准的pandas 函数? (我开头引用的问题使用sqldf

【问题讨论】:

    标签: python pandas dataframe join merge


    【解决方案1】:

    我相信这就是您正在寻找的。使用np.sort 将更改每行前两列的顺序,使其按字母顺序排列,从而使合并正常工作。

    cols = ['first_employee','target_employee']
    df[cols] = np.sort(df[cols].to_numpy(),axis=1)
    df2[cols] = np.sort(df2[cols].to_numpy(),axis=1)
    ndf = pd.merge(df,df2,on = cols,how='left')
    

    【讨论】:

      【解决方案2】:

      创建一个 key 作为排序的元组,首先和目标员工然后合并它:

      create_key = lambda x: tuple(sorted([x['first_employee'], x['target_employee']]))
      out = pd.merge(df1.assign(_key=df1.apply(create_key, axis=1)),
                     df2.assign(_key=df2.apply(create_key, axis=1)),
                     on='_key', suffixes=('', '_key'), how='outer') \
              .loc[:, lambda x: ~x.columns.str.endswith('_key')]
      print(out)
      
      # Output:
         first_employee target_employee  relationship  book_count
      0            Andy          Claude             0         1.0
      1            Andy           Frida            20         1.0
      2            Andy         Georgia           -10         1.0
      3            Andy            Joan            30         1.0
      4            Andy             Lee           -10         1.0
      5            Andy           Pablo           -10         1.0
      6            Andy         Vincent            20         NaN
      7          Claude           Frida             0         NaN
      8          Claude         Georgia            90         3.0
      9          Claude            Joan             0         NaN
      10         Claude             Lee             0         NaN
      11         Claude           Pablo            10         NaN
      12         Claude         Vincent             0         NaN
      13          Frida         Georgia             0         NaN
      14          Frida            Joan             0         NaN
      15          Frida             Lee             0         NaN
      16          Frida           Pablo            50         2.0
      17          Frida         Vincent            60         2.0
      18        Georgia            Joan             0         NaN
      19        Georgia             Lee            10         NaN
      20        Georgia           Pablo             0         NaN
      21        Georgia         Vincent             0         NaN
      22           Joan             Lee            70         3.0
      23           Joan           Pablo             0         NaN
      24           Joan         Vincent            10         NaN
      25            Lee           Pablo             0         NaN
      26            Lee         Vincent             0         NaN
      27          Pablo         Vincent           -20         1.0
      

      【讨论】:

        猜你喜欢
        • 2012-08-21
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2014-12-29
        • 2023-03-09
        • 2021-11-05
        • 1970-01-01
        相关资源
        最近更新 更多