Pyspark 代码优化 - 以更好的方式处理它答案

【问题标题】：Pyspark Code optimization - to handle it in better wayPyspark 代码优化 - 以更好的方式处理它
【发布时间】：2021-02-27 15:20:03
【问题描述】：

我有两个 Pyspark 数据框，分别是 'a' 和 'b'。我通过直接从数据框“a”中选择几个字段来离开，而对于其他字段，我正在检查条件（如果“a”字段为空，则选择“b”字段）。下面的代码绝对可以正常工作，我得到了所需的结果。

df_final = df1.alias('a').join(df2.alias('b'), on=['name_id_forwarded'], how='left')\
    .select(
        'a.name_id','a.SUM','a.full_name',
        f.when(f.isnull(f.col('a.first_name')),f.col('b.first_name')).otherwise(f.col('a.first_name')).alias('first_name'),
        f.when(f.isnull(f.col('a.last_name')),f.col('b.last_name')).otherwise(f.col('a.last_name')).alias('last_name'),
        f.when(f.isnull(f.col('a.email')),f.col('b.email')).otherwise(f.col('a.email')).alias('email'),
        f.when(f.isnull(f.col('a.phone_number')),f.col('b.phone_number')).otherwise(f.col('a.phone_number')).alias('phone_number'),
        f.when(f.isnull(f.col('a.address')),f.col('b.address')).otherwise(f.col('a.address')).alias('address'),
        f.when(f.isnull(f.col('a.address_2')),f.col('b.address_2')).otherwise(f.col('a.address_2')).alias('address_2'),
        f.when(f.isnull(f.col('a.city')),f.col('b.city')).otherwise(f.col('a.city')).alias('city'),
 f.when(f.isnull(f.col('a.email_alt')),f.col('b.email_alt')).otherwise(f.col('a.email_alt')).alias('email_alt'),
       'a.updated','a.date','a.client_reference_code','a.reservation_status',\
       'a.total_cancellations','a.total_covers','a.total_noshows','a.total_spend',\
       'a.total_spend_per_cover','a.total_spend_per_visit','a.total_visits','a.id')

我想知道字段的数量是否会随着时间的推移而增加，那么我将如何使用循环处理这些代码，以便我可以自动化它。

我尝试了下面的代码但出现错误，有人可以帮忙吗？

col_list = [所有必填字段]

df_final = df1.alias('a').join(df2.alias('b'), on=['name_id_forwarded'], how='left')\
    .select('a.name_id','a.SUM','a.full_name',\

for x in col_list:
    f.when(f.isnull(f.col('a.x')),f.col('b.x')).otherwise(f.col('a.x')).alias('x'),
)

我认为在选择中我不能使用循环，请给我其他建议。strong text

【问题讨论】：

标签： python dataframe apache-spark pyspark left-join

【解决方案1】：

在列表中添加所需的列或列表达式，然后将该列表传递给select。

检查下面的代码。

col_list = [all required fields]

使用when函数

colExpr = ['a.name_id','a.SUM','a.full_name'] + list(map(lambda x: f.when(f.isnull(f.col('a.x')),f.col('b.x')).otherwise(f.col('a.x')).alias('x'),col_list))

df_final = df1.alias('a').join(df2.alias('b'), on=['name_id_forwarded'], how='left').select(*colExpr) # select

使用nvl函数

colExpr = ['a.name_id','a.SUM','a.full_name'] + list(map(lambda x: "nvl(a.{},b.{}) as {}".format(x,x,x),col_list))

df_final = df1.alias('a').join(df2.alias('b'), on=['name_id_forwarded'], how='left').selectExpr(*colExpr) # selectExpr

【讨论】：