【发布时间】:2022-01-19 14:13:20
【问题描述】:
【问题讨论】:
标签: python pandas join duplicates
【问题讨论】:
标签: python pandas join duplicates
要获取除了两个 pandas 数据集的交集之外的所有内容,请尝试以下操作:
# Everything from the first except what is on second
r1 = df1[~df1.isin(df2)]
# Everything from the second except what is on first
r2 = df2[~df2.isin(df1)]
# concatenate and drop NANs
result = pd.concat(
[r1, r2]
).dropna().reset_index(drop=True)
但有一个警告,当使用布尔掩码进行过滤时,您的 int 值可能会变成浮点数。默认情况下,pandas 用浮点版本的 NAN 替换不需要的 (False) 值,并将整个列转换为浮点数。您可以在下面的示例中看到这种情况。
为避免这种情况,请在创建数据框时显式声明数据类型。
import pandas as pd
df1 = pd.read_csv("./csv1.csv") #, dtype='Int64')
print(f"csv1\n{df1}\n")
df2 = pd.read_csv("./csv2.csv") #, dtype='Int64')
print(f"csv2\n{df2}\n")
# Everything from first except what is on second
r1 = df1[~df1.isin(df2)]
# Everything from second except what is on first
r2 = df2[~df2.isin(df1)]
# concatenate and drop NANs
result = pd.concat(
[r1, r2]
).dropna().reset_index(drop=True)
print(f"result\n{result}\n")
csv1
A B C
0 1 2 3
1 4 5 6
2 7 8 9
csv2
A B C
0 1 2 3
1 4 5 6
2 10 11 12
result
A B C
0 7.0 8.0 9.0
1 10.0 11.0 12.0
【讨论】: