使用pyspark检查数据框的所有列中是否存在大于零的值答案

【问题标题】：Check if value greater than zero exists in all columns of dataframe using pyspark使用pyspark检查数据框的所有列中是否存在大于零的值
【发布时间】：2020-02-10 02:57:07
【问题描述】：

   data.select([count(when(isnan(c), c)).alias(c) for c in data.columns]).show()

这是我试图获取 nan 值计数的代码。我想编写一个 if-else 条件，如果特定列包含 nan 值，我想打印列的名称和 nan 值的计数。

【问题讨论】：

查看stackoverflow.com/questions/44627386/…
如果 nan 大于 0，我如何检查每一列？

标签： python apache-spark pyspark apache-spark-sql pyspark-sql

【解决方案1】：

如果我理解正确，您希望先执行列过滤，然后再将其传递给列表推导式。

例如，您有一个如下所示的 df，其中 c 列 nan free，

from pyspark.sql.functions import isnan, count, when
import numpy as np

df = spark.createDataFrame([(1.0, np.nan, 0.0), (np.nan, 2.0, 9.0),\
                          (np.nan, 3.0, 8.0), (np.nan, 4.0, 7.0)], ('a', 'b', 'c'))

df.show()
# +---+---+---+
# |  a|  b|  c|
# +---+---+---+
# |1.0|NaN|0.0|
# |NaN|2.0|9.0|
# |NaN|3.0|8.0|
# |NaN|4.0|7.0|
# +---+---+---+

你得到了生产的解决方案和材料

df.select([count(when((isnan(c)),c)).alias(c) for c in df.columns]).show()
# +---+---+---+
# |  a|  b|  c|
# +---+---+---+
# |  3|  1|  0|
# +---+---+---+

但你想要

# +---+---+
# |  a|  b|
# +---+---+
# |  3|  1|
# +---+---+

为了得到那个输出，你可以试试这个

rows = df.collect()

#column filtering based on your nan condition
nan_columns = [''.join(key) for _ in rows  for (key,val) in _.asDict().items() if np.isnan(val)]

nan_columns = list(set(nan_columns)) #may sort if order is important
#nan_columns
#['a', 'b']

df.select([count(when((isnan(c)),c)).alias(c) for c in nan_columns]).show()
# +---+---+
# |  a|  b|
# +---+---+
# |  3|  1|
# +---+---+

【讨论】：

使用df.collect()似乎很慢。有没有更好的办法？
@rosefun 如果您想避免在具有多行的大型数据集上使用 df.collect()，那么我建议首先获取第二个代码块中的输出，例如，使用一个选择语句。其次，执行条件过滤以删除该单行数据帧上为零的列。

【解决方案2】：

您可以将相同的理解转换为：

df.select([count(when(c > 0, c)).alias(c) for c in data.columns]).show()

但是当您有其他dtypes 时，这会导致问题。所以让我们一起去吧：

from pyspark.sql.functions import col
# You can do the following two lines of code in one line, but want to make it more readable
schema = {col: col_type for col, col_type in df.dtypes}
numeric_columns = [
            col for col, col_type in schema.items()
            if col_type in "int double bitint".split()
        ]

df.select([count(when(col(c) > 0, c)).alias(c) for c in numeric_columns]).show()

【讨论】：