【发布时间】:2021-11-12 17:05:30
【问题描述】:
我有两个数据框:
df1:
+----------+-------------+-------------+--------------+---------------+
|customerId| fullName| telephone1| telephone2| email|
+----------+-------------+-------------+--------------+---------------+
| 201534|MARIO JIMENEZ|01722-3500391|+5215553623333|ascencio@my.com|
| 879535| MARIO LOPEZ|01722-3500377|+5215553623333| asceloe@my.com|
+----------+-------------+-------------+--------------+---------------+
df2:
+----------+-------------+-------------+--------------+---------------+
|customerId| fullName| telephone1| telephone2| email|
+----------+-------------+-------------+--------------+---------------+
| 201534|MARIO JIMENEZ|01722-3500391|+5215553623333|ascencio@my.com|
| 201536| ROBERT MITZ|01722-3500377|+5215553623333| asceloe@my.com|
| 201537| MARY ENG|01722-3500127|+5215553623111|generic1@my.com|
| 201538| RICK BURT|01722-3500983|+5215553623324|generic2@my.com|
| 201539| JHON DOE|01722-3502547|+5215553621476|generic3@my.com|
+----------+-------------+-------------+--------------+---------------+
我需要从 df1 中获取第三个 DataFrame,这些 DataFrame 在 df2 中不存在。
像这样:
+----------+-------------+-------------+--------------+---------------+
|customerId| fullName| telephone1| telephone2| email|
+----------+-------------+-------------+--------------+---------------+
| 879535| MARIO LOPEZ|01722-3500377|+5215553623333| asceloe@my.com|
+----------+-------------+-------------+--------------+---------------+
这样做的正确方法是什么?
我已经尝试过以下方法:
diff = df2.join(df1, df2['customerId'] != df1['customerId'],"left")
diff = df1.subtract(df2)
diff = df1[~ df1['customerId'].isin(df2['customerId'])]
但它们不起作用,有什么建议吗?
【问题讨论】:
-
一般来说,如果你能提供代码来生成你的数据框,人们会更容易提供帮助。
-
您的“喜欢这个”示例是 df2 中确实存在的示例,但是您说您的“需要”是“df2 中不存在的”请解决矛盾,否则我们不能这样。
标签: pandas dataframe pyspark aws-glue aws-glue-spark