获取 PySpark 数据框中每行的 null 数答案

【问题标题】：Get the number of null per row in PySpark dataframe获取 PySpark 数据框中每行的 null 数
【发布时间】：2018-09-21 13:16:07
【问题描述】：

这可能是重复的，但不知何故我已经搜索了很长时间：

我想获取 Spark 数据帧中每行的空值数。即

col1 col2 col3
null    1    a
   1    2    b
   2    3 null

到底应该是：

col1 col2 col3 number_of_null
null    1    a              1
   1    2    b              0
   2    3 null              1

一般来说，我想获取某个字符串或数字出现在 spark 数据框行中的次数。

即

col1 col2 col3  number_of_ABC
 ABC    1    a              1
   1    2    b              0
   2  ABC  ABC              2

我正在使用 Pyspark 2.3.0，并且更喜欢不涉及 SQL 语法的解决方案。出于某种原因，我似乎无法谷歌这个。：/

编辑：假设我有很多列，我无法将它们全部列出。

EDIT2：我明确不想有熊猫解决方案。

EDIT3：用总和或平均值解释的解决方案不起作用，因为它会引发错误：

(data type mismatch: differing types in '((`log_time` IS NULL) + 0)' (boolean and int))
...
isnull(log_time#10) + 0) + isnull(log#11))

【问题讨论】：

Spark DataFrame: Computing row-wise mean (or any aggregate operation)的可能重复
查看链接的骗子：df.select(sum(col(x).isNull() for x in df.columns)).alias("number_of_null")
当在我的数据集上执行此操作时，我收到：py4j.protocol.Py4JJavaError: An error occurred while calling o1999.select. : org.apache.spark.sql.AnalysisException: cannot resolve '((log_time` IS NULL) + 0)' 由于数据类型不匹配：'((log_time IS NULL) + 0 中的不同类型)' (boolean and int).;;`
将布尔值转换为 int：df.select(sum((col(x).isNull()).cast("int") for x in df.columns)).alias("number_of_null")
这似乎行得通。谢谢！！

标签： pyspark apache-spark-sql

【解决方案1】：

在 Scala 中：

val df = List(
  ("ABC", "1", "a"),
  ("1", "2", "b"),
  ("2", "ABC", "ABC")
).toDF("col1", "col2", "col3")
val expected = "ABC"
val complexColumn: Column = df.schema.fieldNames.map(c => when(col(c) === lit(expected), 1).otherwise(0)).reduce((a, b) => a + b)
df.withColumn("countABC", complexColumn).show(false)

输出：

+----+----+----+--------+
|col1|col2|col3|countABC|
+----+----+----+--------+
|ABC |1   |a   |1       |
|1   |2   |b   |0       |
|2   |ABC |ABC |2       |
+----+----+----+--------+

【讨论】：

【解决方案2】：

如 pasha701 的回答所述，我求助于 map 和 reduce。请注意，我正在使用 Spark 1.6.x 和 Python 2.7

将您的 DataFrame 用作 df（并且按原样）

dfvals = [
  (None, "1", "a"),
  ("1", "2", "b"),
  ("2", None, None)
]

df = sqlc.createDataFrame(dfvals, ['col1', 'col2', 'col3'])

new_df = df.withColumn('null_cnt', reduce(lambda x, y: x + y,
                                         map(lambda x: func.when(func.isnull(func.col(x)) == 'true', 1).otherwise(0),
                                             df.schema.names)))

检查值是否为Null 并分配1 或0。添加结果以获取计数。

new_df.show()

+----+----+----+--------+
|col1|col2|col3|null_cnt|
+----+----+----+--------+
|null|   1|   a|       1|
|   1|   2|   b|       0|
|   2|null|null|       2|
+----+----+----+--------+

【讨论】：