通过传递列表中每个 ID 的值来过滤具有多个 ID<PKs> 列的数据框答案

【问题标题】：Filter the dataframe of having multiple ID<PKs> columns by passing the values of each IDs in a list通过传递列表中每个 ID 的值来过滤具有多个 ID<PKs> 列的数据框
【发布时间】：2020-11-06 18:14:08
【问题描述】：

尝试通过传递列表中每个 ID 的值来过滤具有多个 ID 列的数据框。

例如：Df:

location_user
transactiontime (string)
user_id (bigint)
location_id (bigint)
Address1 (string)
Address2 (string)
user_name (string)
loc_name (string)

在上面的Dataframe中：user_id和location_id都是ID列。

目标：针对数据框过滤 user_id=[42939,42940] 和 location_id=[1468,1469]。

如下创建单独的列表并将它们应用到 df.filter。

partition_key =['user_id', 'location_id']
filter_cond = ['[42939,42940]', '[1468,1469]']

---> 为单个 partition_key 工作

filter_df=actual_df.filter(~col(partition_key).isin(filter_cond))

尝试了下面的 partition_key 组合，但它不起作用并出现以下错误。

filter_df=actual_df.filter(~col(partition_key).isInCollection(filter_cond))

错误：覆盖目录时发生错误。请检查是否传递了正确的参数。异常：发生错误时调用 z:org.apache.spark.sql.functions.col。痕迹： py4j.Py4JException：方法 col([class java.util.ArrayList]) 没有存在

感谢任何建议。

【问题讨论】：

标签： python sql dataframe pyspark apache-spark-sql

【解决方案1】：

你可以通过压缩条件来实现这一点

partition_key =['id', 'id2']
filter_cond = [[1,2], [100,200]]
cond = ' AND '.join([f'{colname} in {tuple(cond)}' for colname, cond in zip(partition_key,filter_cond)])
print(cond)

df.filter(expr(cond)).show()

#id in (1, 2) AND id2 in (100, 200)
#+---+---+
#| id|id2|
#+---+---+
#|  1|100|
#|  1|200|
#|  2|100|
#|  2|200|
#+---+---+

单个元素的更新

cond = ' AND '.join([f'{colname} in ({",".join(map(str,a))})' for colname, cond in zip(partition_key,filter_cond)])

【讨论】：

您好 Shubham，感谢您的回复。我已经尝试过，这次我使用了以下过滤器 PARTITION KEYS: ['user_id', 'location_id'] 和 [[17954], [3350]] 它在列表中有单个值。但由于条件“(17954,) AND location_id in (3350,)”中的逗号过多，它失败了。但是，它适用于包含多列的列表。真的非常感谢。