替换基于多个条件的pyspark数据框中的列值答案

【问题标题】：replace column values in pyspark dataframe based multiple conditions替换基于多个条件的pyspark数据框中的列值
【发布时间】：2021-12-13 12:44:40
【问题描述】：

我有以下 pyspark 数据框

a = ['480s','480s','499s','499s','650s','650s','702s','702s','736s','736s','736s','737s','737s']
b = ['North','West','East','North','East','North','North','West','North','South','West','North','West']


df = pd.DataFrame(dict(dcode=a, zone=b))



dcode   zone
0   480s  North
1   480s   West
2   499s   East
3   499s  North
4   650s   East
5   650s  North
6   702s  North
7   702s   West
8   736s  North
9   736s  South
10  736s   West
11  737s  North
12  737s   West

我希望我的数据框看起来像 -

 dcode   zone output
0   480s  North     NW
1   480s   West     NW
2   499s   East       
3   499s  North     NW
4   650s   East       
5   650s  North     NW
6   702s  North       
7   702s   West       
8   736s  North       
9   736s  South       
10  736s   West       
11  737s  North       
12  737s   West

同样，我正在使用这个逻辑，但它没有给出想要的结果。

  df_ = df.withColumn("output", F.when((F.col("Zone") == "North") | (F.col("Zone") == "West") & (F.col("dcode") != "702s") | (F.col("dcode") != "736s") | (F.col("dcode") != "737s"), "NW"))

仅当区域为北或西且解码不在 736,737s,702s 中时，我才希望在输出列中出现 NW。

【问题讨论】：

标签： apache-spark pyspark

【解决方案1】：

请考虑首先将您的pandasdf 转换为spark，因为您使用的是pypark 语法。然后我会建议使用isin 将您的代码重写为更简洁和更清晰的方式：

from pyspark.sql import functions as F
df = spark.createDataFrame(df)

df_ = df.withColumn("output", F.when(
  (F.col("Zone").isin("North","West")) & (~F.col("dcode").isin('736s','737s','702s')
                                         ),"NW").otherwise(""))

>>> df_.show(truncate=False)

+-----+-----+------+
|dcode|zone |output|
+-----+-----+------+
|480s |North|NW    |
|480s |West |NW    |
|499s |East |      |
|499s |North|NW    |
|650s |East |      |
|650s |North|NW    |
|702s |North|      |
|702s |West |      |
|736s |North|      |
|736s |South|      |
|736s |West |      |
|737s |North|      |
|737s |West |      |
+-----+-----+------+

【讨论】：

【解决方案2】：

您可以直接使用 SQL 风格的表达式（expr 函数）。

import pyspark.sql.functions as F
......
df = df.withColumn('output', F.expr("case when zone in ('North', 'West') and dcode not in ('736s', '737s', '702s') then 'NW' end"))
......

【讨论】：

【解决方案3】：

检查括号

顺便说一句，df = pd.DataFrame(dict(dcode=a, zone=b)) 不是 PySpark

from pyspark.sql import functions as F
import pandas as pd


a = ['480s','480s','499s','499s','650s','650s','702s','702s','736s','736s','736s','737s','737s']
b = ['North','West','East','North','East','North','North','West','North','South','West','North','West']


df = pd.DataFrame(dict(dcode=a, zone=b))

df_ = spark.createDataFrame(df)

df_ = df_.withColumn("output", F.when((\
                                       ((F.col("Zone") == "North") | (F.col("Zone") == "West")) & ((F.col("dcode") != "702s") | (F.col("dcode") != "736s") | (F.col("dcode") != "737s"))\
                                      ), "NW"))

df_.show()

+-----+-----+------+
|dcode| zone|output|
+-----+-----+------+
| 480s|North|    NW|
| 480s| West|    NW|
| 499s| East|  null|
| 499s|North|    NW|
| 650s| East|  null|
| 650s|North|    NW|
| 702s|North|    NW|
| 702s| West|    NW|
| 736s|North|    NW|
| 736s|South|  null|
| 736s| West|    NW|
| 737s|North|    NW|
| 737s| West|    NW|
+-----+-----+------+

【讨论】：