解释 pandas DataFrame join 的工作原理

【问题标题】：Explain how pandas DataFrame join works解释 pandas DataFrame join 的工作原理
【发布时间】：2023-04-02 19:23:01
【问题描述】：

为什么内部连接在 pandas 中如此奇怪？

例如：

import pandas as pd
import io

t1 = ('key,col1\n'
      '1,a\n'
      '2,b\n'
      '3,c\n'
      '4,d')

t2 = ('key,col2\n'
      '1,e\n'
      '2,f\n'
      '3,g\n'
      '4,h')


df1 = pd.read_csv(io.StringIO(t1), header=0)
df2 = pd.read_csv(io.StringIO(t2), header=0)

print(df1)
print()
print(df2)
print()
print(df2.join(df1, on='key', how='inner', lsuffix='_l'))

输出：

   key col1
0    1    a
1    2    b
2    3    c
3    4    d

   key col2
0    1    e
1    2    f
2    3    g
3    4    h

   key_l col2  key col1
0      1    e    2    b
1      2    f    3    c
2      3    g    4    d

如果我不指定lsuffix，它会说

ValueError: columns overlap but no suffix specified: Index(['key'], dtype='object')

这个函数的工作方式与 SQL 的 JOIN 有什么不同吗？为什么要创建一个带有后缀的额外“键”列？为什么只有 3 行？我希望它会输出如下内容：

   key col1 col2
0    1    a    e
1    2    b    f
2    3    c    g
3    4    d    h

【问题讨论】：

标签： python python-3.x pandas dataframe

【解决方案1】：

第一件事：
你想要的是合并

df1.merge(df2)

join 默认合并index。您可以指定on 参数，该参数仅说明左侧的哪一列与右侧的索引匹配。

这些可能有助于说明

df1.set_index('key').join(df2.set_index('key'))

df1.join(df2.set_index('key'), on='key')

您的示例将df2 的索引（看起来像[0, 1, 2, 3]）与df1 的key 列相匹配，看起来像[1, 2, 3, 4]
这就是为什么当key_l 是4 时你在col2 中得到NaN 的原因

df1.join(df2, on='key', lsuffix='_l', how='outer')

【讨论】：

谢谢。那么加入有什么作用呢？
@spiderface 正在研究更全面的答案。帖子已更新。