pyspark join 与空字符串的条件答案

【问题标题】：pyspark join with conditions for empty stringpyspark join 与空字符串的条件
【发布时间】：2021-02-23 16:52:15
【问题描述】：

我有如下三个数据框。

df_prod

Year  ID      Name   brand  Point 
2020  20903   Ken    KKK    2000
2019  12890   Matt   MMM    209
2017  346780  Nene   NNN    2000
2020  346780  Nene   NNN    6000

df_miss

Name    brand   point
Holy    HHH     345
Joshi   JJJ     900

df_sale

ID      Name  Year    brand   
126789  Holy  2010            
346780  Nene  2017    NNN     
346780  Nene  2020    NNN

我需要根据以下条件加入 df_sale。如果“品牌”不为空，那么我需要在 INNER 上加入 df_sale 和 df_prod，加入年份和名称。如果 "brand" 为 NULL，那么我需要根据 Name 加入 df_sale 和 df_miss。

在 pyspark 中加入时是否可以有条件？我可以在 scala 上看到一些示例，但我正在寻找 pyspark 实现。

伪代码逻辑

if brand != null
   df_sale.join(df_prod, on=['Year', 'ID'], how='inner') and df_sale['Name'] = df_prod['Name'] & df_sale['point'] = df_prod['point']
   
elif brand == null
   df_sale.join(df_miss, on=['Name'], how='nner') and
   df_sale['point'] = df_prod['point']

预期输出：

ID      Name  Year    brand   point
126789  Holy  2010            345
346780  Nene  2017    NNN     2000
346780  Nene  2020    NNN     2000

是否可以在 pyspark 或 SQL 中进行。请指点一下。谢谢。

【问题讨论】：

标签： dataframe join pyspark

【解决方案1】：

当您考虑 DataFrame（或就此而言，SQL 表）中的 IF ... ELSE ... 条件时，请注意这些条件需要应用于表，就像您逐行遍历它一样。

这让您有两个选择（请注意f 表示pyspark.sql.functions）：

您根据f.col("brand").isNull() 条件使用[input_df.filter(~fail_test), input_df.filter(fail_test)] 之类的东西将df_sale 表分成两部分-df_sale_brand_null 和df_sale_brand。然后在所需列上加入相关表（df_sales_brand_null 和 df_miss），处理未对齐的列，最后 unionByName 两个连接表。
你 union 数据帧 df_miss 和 df_prod，处理 df_miss 中缺失的列。然后在条件语句上将df_sale 与联合表（分别别名a 和b）连接起来，例如f.when(f.col("brand").isNotNull(), (f.col("a.Year") == f.col("b.Year")) & (f.col("a.ID") == f.col("b.ID")).otherwise(f.col("a.Name") == f.col("b.Name"))。 f.when(...).otherwise(...) 的输出是一列，因此您的 join 语句会将其识别为 on= 参数的有效输入。

【讨论】：

感谢您的逻辑。我会试试的。