如何在非简单标准上使用 Pandas 执行 DataFrame 的内部或外部联接答案

【问题标题】：how to perform an inner or outer join of DataFrames with Pandas on non-simplistic criterion如何在非简单标准上使用 Pandas 执行 DataFrame 的内部或外部联接
【发布时间】：2013-03-12 23:42:16
【问题描述】：

给定两个数据框如下：

>>> import pandas as pd

>>> df_a = pd.DataFrame([{"a": 1, "b": 4}, {"a": 2, "b": 5}, {"a": 3, "b": 6}])
>>> df_b = pd.DataFrame([{"c": 2, "d": 7}, {"c": 3, "d": 8}])
>>> df_a
   a  b
0  1  4
1  2  5
2  3  6

>>> df_b
   c  d
0  2  7
1  3  8

我们希望使用非简单标准生成两个数据帧的 SQL 样式连接，例如“df_b.c > df_a.a”。据我所知，虽然merge() 肯定是解决方案的一部分，但我不能直接使用它，因为它不接受“ON”标准的任意表达式（除非我遗漏了什么？）。

在 SQL 中，结果如下所示：

# inner join
sqlite> select * from df_a join df_b on c > a;
1|4|2|7
1|4|3|8
2|5|3|8

# outer join
sqlite> select * from df_a left outer join df_b on c > a;
1|4|2|7
1|4|3|8
2|5|3|8
3|6||

我目前的内部连接方法是生成笛卡尔积 df_a 和 df_b，通过向两者添加一列“1”，然后使用在“1”列上合并（），然后应用“c > a”标准。

>>> import numpy as np
>>> df_a['ones'] = np.ones(3)
>>> df_b['ones'] = np.ones(2)
>>> cartesian = pd.merge(df_a, df_b, left_on='ones', right_on='ones')
>>> cartesian
   a  b  ones  c  d
0  1  4     1  2  7
1  1  4     1  3  8
2  2  5     1  2  7
3  2  5     1  3  8
4  3  6     1  2  7
5  3  6     1  3  8
>>> cartesian[cartesian.c > cartesian.a]
   a  b  ones  c  d
0  1  4     1  2  7
1  1  4     1  3  8
3  2  5     1  3  8

对于外部连接，到目前为止，我不确定最好的方法我一直在玩获取内部连接，然后应用否定获取所有其他行的条件，然后尝试对其进行编辑 “否定”设置在原件上，但它并没有真正起作用。

编辑。 HYRY 在这里回答了具体问题，但我需要在 Pandas API 中更通用和更多的东西，因为我的加入标准可以是任何东西，而不仅仅是一个比较。对于外连接，首先我在“左侧”添加一个额外的索引，在我执行内连接后将保持自身：

df_a['_left_index'] = df_a.index

然后我们做笛卡尔并得到内连接：

cartesian = pd.merge(df_a, df_b, left_on='ones', right_on='ones')
innerjoin = cartesian[cartesian.c > cartesian.a]

然后我在“df_a”中获取我们需要的额外索引 ID，并从“df_a”中获取行：

remaining_left_ids = set(df_a['_left_index']).\
                    difference(innerjoin['_left_index'])
remaining = df_a.ix[remaining_left_ids]

然后我们使用直接的 concat()，它将缺失的列替换为左侧的“NaN”（我认为它之前没有这样做，但我猜它确实这样做了）：

outerjoin = pd.concat([innerjoin, remaining]).reset_index()

HYRY 仅在我们需要比较的那些列上进行笛卡尔运算的想法基本上是正确的答案，尽管在我的具体情况下，实施起来可能有点棘手（一般化和全部）。

问题：

如何在“c > a”上生成 df_1 和 df_2 的“连接”？将您使用相同的“笛卡尔积，过滤器”方法还是有更好的方法怎么办？
你将如何产生相同的“左外连接”？

【问题讨论】：

标签： python sql numpy pandas

【解决方案1】：

来自pyjanitor的conditional_join可能有助于抽象/方便；该功能目前在开发中：

# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git
import pandas as pd
import janitor

内联

 df_a.conditional_join(df_b, ('a', 'c', '<'))

  left    right
     a  b     c  d
0    1  4     2  7
1    1  4     3  8
2    2  5     3  8

左加入

df_a.conditional_join(df_b, ('a', 'c', '<'), how = 'left')

  left    right
     a  b     c    d
0    1  4   2.0  7.0
1    1  4   3.0  8.0
2    2  5   3.0  8.0
3    3  6   NaN  NaN

该函数为条件（col from left、col from_right、join operator）采用元组的变量（*args）参数

【讨论】：

【解决方案2】：

这可以通过广播和 np.where 来完成。使用任何你想要的二元运算符来评估为真/假：

import operator as op

df_a = pd.DataFrame([{"a": 1, "b": 4}, {"a": 2, "b": 5}, {"a": 3, "b": 6}])
df_b = pd.DataFrame([{"c": 2, "d": 7}, {"c": 3, "d": 8}])

binOp   = op.lt
matches = np.where(binOp(df_a.a[:,None],df_b.c.values))

print pd.concat([df.ix[idxs].reset_index(drop=True) 
                 for df,idxs in zip([df_a,df_b],matches)],
                axis=1).to_csv()

,a,b,c,d

0,1,4,2,7

1,1,4,3,8

2,2,5,3,8

【讨论】：

【解决方案3】：

我使用ufunc的outer方法计算结果，下面是例子：

首先，一些数据：

import pandas as pd
import numpy as np
df_a = pd.DataFrame([{"a": 1, "b": 4}, {"a": 2, "b": 5}, {"a": 3, "b": 6}, {"a": 4, "b": 8}, {"a": 1, "b": 7}])
df_b = pd.DataFrame([{"c": 2, "d": 7}, {"c": 3, "d": 8}, {"c": 2, "d": 10}])
print "df_a"
print df_a
print "df_b"
print df_b

输出：

内连接，因为这里只计算c & a的笛卡尔积，内存使用小于整个DataFrame的笛卡尔积：

ia, ib = np.where(np.less.outer(df_a.a, df_b.c))
print pd.concat((df_a.take(ia).reset_index(drop=True), 
                 df_b.take(ib).reset_index(drop=True)), axis=1)

输出：

   a  b  c   d
0  1  4  2   7
1  1  4  3   8
2  1  4  2  10
3  2  5  3   8
4  1  7  2   7
5  1  7  3   8
6  1  7  2  10

要计算左外连接，使用numpy.setdiff1d() 查找df_a 中所有不在内连接中的行：

na = np.setdiff1d(np.arange(len(df_a)), ia)
nb = -1 * np.ones_like(na)
oa = np.concatenate((ia, na))
ob = np.concatenate((ib, nb))
print pd.concat([df_a.take(oa).reset_index(drop=True), 
                 df_b.take(ob).reset_index(drop=True)], axis=1)

输出：

   a  b   c   d
0  1  4   2   7
1  1  4   3   8
2  1  4   2  10
3  2  5   3   8
4  1  7   2   7
5  1  7   3   8
6  1  7   2  10
7  3  6 NaN NaN
8  4  8 NaN NaN

【讨论】：

还在解析这个，有没有办法使用 Pandas Series 完成表达式（即由“df_a.a
虽然只是在列上做笛卡尔的想法，我需要节省内存，值得研究......