是否可以使用 Pyspark 列使用“if 条件”python？ [复制]答案

【问题标题】：Is it possible to use "if condition" python using Pyspark columns? [duplicate]是否可以使用 Pyspark 列使用“if 条件”python？ [复制]
【发布时间】：2022-01-17 18:24:31
【问题描述】：

我正在尝试在 python 函数中使用 if 条件，然后使用它对数据框值进行一些计算。

#init data
+---+----+----+------+
| id|team|game|result|
+---+----+----+------+
|  1|   A|Home|      |
|  2|   A|Away|      |
|  3|   B|Home|      |
|  4|   B|Away|      |
|  5|   C|Home|      |
|  6|   C|Away|      |
|  7|   D|Home|      |
|  8|   D|Away|      |
+---+----+----+------+

### I wanna replace the value result and I tried use a function

def replace_result(team_name,game_kind,result):
  if col('team') == team_name and col('game') == game_kind:
     return result
  else:
     return col('result')

df = df.withColumn('result',replace_result('A','Away','0-1')

但给了我错误

ValueError：无法将列转换为布尔值：请使用 '&' 表示 'and'、'|'在构建 DataFrame 布尔表达式时，for 'or', '~' for 'not'。

我的问题是

是否可以使用 Pyspark 数据框列的 if 条件？

谢谢

【问题讨论】：

标签： python apache-spark pyspark

【解决方案1】：

是的，有称为 when 和 otherwise 的内置 spark.sql 函数可以做到这一点。

使用以下数据框。

df.show()
+---+----+----+
| id|team|game|
+---+----+----+
|  1|   A|Home|
|  2|   A|Away|
|  3|   B|Home|
|  4|   B|Away|
|  5|   C|Home|
|  6|   C|Away|
|  7|   D|Home|
|  8|   D|Away|
+---+----+----+

您可以通过以下方式使用when 和otherwise 条件。

from pyspark.sql import functions

df = (df.withColumn("result", 
        functions.when((df["team"] == "A") & (df["game"] == "Home"), "WIN")
                 .when((df["team"] == "B") & (df["game"] == "Away"), "WIN")
                 .when((df["team"] == "D") & (df["game"] == "Home"), "WIN")
                 .when((df["team"] == "D") & (df["game"] == "Away"), "WIN")
                 .otherwise("LOSS")))

df.show()
+---+----+----+------+
| id|team|game|result|
+---+----+----+------+
|  1|   A|Home|   WIN|
|  2|   A|Away|  LOSS|
|  3|   B|Home|  LOSS|
|  4|   B|Away|   WIN|
|  5|   C|Home|  LOSS|
|  6|   C|Away|  LOSS|
|  7|   D|Home|   WIN|
|  8|   D|Away|   WIN|
+---+----+----+------+

【讨论】：

【解决方案2】：

在使用 DataFrames 时，您需要为自定义代码使用 udf。

https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.udf.html

【讨论】：

错误是一样的
你不能只用 udf 装饰器来装饰你的函数。您需要对其进行返工，以便它使用 python 类型而不是列表达式进行操作。