获取所有列条目为空的 groupby 的数据框答案

【问题标题】：get dataframe of groupby where all column entries are null获取所有列条目为空的 groupby 的数据框
【发布时间】：2021-10-15 13:32:54
【问题描述】：

我正在使用 pyspark 2.4.5 并且有一个数据框，我已经对其进行过滤以包含所有条目作为 groupby 的一部分，这些条目包含空值

df_nulls = df.where(reduce(lambda x, y: x | y, (col(c).isNull() for c in df.columns)))

据此，我想进一步过滤以删除（并获取单独的数据框）所有列都具有空值的所有条目。

目前，我可以通过检查该列的最小值和最大值是否都为 null 并基于此返回 1 或 0 来为一列实现此目的

agg_expression = [when((min(c).eqNullSafe(max(c))).alias(c) , 1).otherwise(0).alias(c) for c in columns]

df_run_all_nulls = df_nulls.groupby("cat1", "cat2", "cat3", "cat4").agg(*agg_expression)

然后我可以在此数据帧上进一步过滤以获取与 null 或非 null 值相关的条目

df_run_all_nulls.where(df_run_all_nulls.col1 == 1).count()

我可以循环并获取数据集中每一列的信息（我对不同列之间所有空值的重叠感兴趣），但想知道是否有更好/更智能的方法来做某事像这样？

我还想知道是否有所有列都为空的条目。

我的初始数据的示例数据框看起来像

| cat1 | col1 | col2 | col3 | col4 |
| 1    | 1    | null | null | null | 
| 1    | 2    | null | null | null |
| 2    | 1    | 50   | 0.3  | 2    |
| 2    | 2    | 60   | 0.3  | 6    |
| 1    | 3    | null | null | null |
| 3    | 1    | null | 10   | null |
| 3    | 2    | null | 2    | 2    |
| 3    | 3    | null | 20   | 4    |

其中 cat1 表示一个分组（在我的例子中是一个正在运行的进程），col1 表示一个时间步长，它的长度可以根据正在运行的进程而变化，然后 cols 2 和 3 是在此过程中每个时间步长的传感器读数.

所以我想从上面提取两个数据帧，一个只包含所有传感器数据为空的进程，但这里会有默认情况下始终记录数据的列，因此空检查应该在列。

| cat1 | col1 | col2 | col3 | col4 |
| 1    | 1    | null | null | null | 
| 1    | 2    | null | null | null |
| 1    | 3    | null | null | null |

这里实际上只是一个唯一 cat1 条目的列表就足够了，在这种情况下 [1]（但实际上会发现更多）

第二个数据帧应该只包含一些数据包含空值的进程。

| cat1 | col1 | col2 | col3 | col4 |
| 3    | 1    | null | 10   | null |
| 3    | 2    | null | 2    | 2    |
| 3    | 3    | null | 20   | 4    |

【问题讨论】：

哎呀，对不起！我现在更新了表格以显示两种可能的 null 情况。进程 1 中的所有内容（时间戳除外）和进程 3 中只有一些记录为空。
明确一点，您的问题与 groupBy 无关。您并没有真正对任何内容进行分组，您只需选择具有特定 ID 的空值的行。
在第一个示例中，我希望每个组的所有条目都具有空值，也许我的问题表述错误。我想因为我需要每个组/类别的所有条目都为空，所以这可能会通过 group by 来实现。但是，如果有更好/不同的方法也可以，我只是不确定术语。
你的 spark 版本是什么？
pyspark 是 2.4.5

标签： python dataframe apache-spark pyspark

【解决方案1】：

让我们用一些Window 函数试试这个：

from functools import reduce

from pyspark.sql import functions as F, Window as W


exclude_cols = ["cat1", "col1"]

df = reduce(
    lambda a, b: a.withColumn(b["colName"], b["col"]),
    [
        {
            "colName": f"{col}_grp",
            "col": F.max(F.when(F.col(col).isNotNull(), 1).otherwise(0)).over(
                W.partitionBy("cat1")
            ),
        }
        for col in df.columns
        if col not in exclude_cols
    ],
    df,
)

df.show()
+----+----+----+----+----+--------+--------+--------+
|cat1|col1|col2|col3|col4|col2_grp|col3_grp|col4_grp|
+----+----+----+----+----+--------+--------+--------+
|   1|   1|null|null|null|       0|       0|       0|
|   1|   2|null|null|null|       0|       0|       0|
|   1|   3|null|null|null|       0|       0|       0|
|   3|   1|null|10.0|null|       0|       1|       1|
|   3|   2|null| 2.0|   2|       0|       1|       1|
|   3|   3|null|20.0|   4|       0|       1|       1|
|   2|   2|  60| 0.3|   6|       1|       1|       1|
|   2|   1|  50| 0.3|   2|       1|       1|       1|
+----+----+----+----+----+--------+--------+--------+

从此数据框中，您可以通过简单的 where 选择所需的行：

# first dataframe 
df.where(
    F.greatest(*(F.col(col) for col in df.columns if col.endswith("_grp"))) == 0
).show()
+----+----+----+----+----+--------+--------+--------+                           
|cat1|col1|col2|col3|col4|col2_grp|col3_grp|col4_grp|
+----+----+----+----+----+--------+--------+--------+
|   1|   1|null|null|null|       0|       0|       0|
|   1|   2|null|null|null|       0|       0|       0|
|   1|   3|null|null|null|       0|       0|       0|
+----+----+----+----+----+--------+--------+--------+

# second one (which theoretically should include ID 1 also)
df.where(
    F.least(*(F.col(col) for col in df.columns if col.endswith("_grp"))) == 0
).show()
+----+----+----+----+----+--------+--------+--------+                           
|cat1|col1|col2|col3|col4|col2_grp|col3_grp|col4_grp|
+----+----+----+----+----+--------+--------+--------+
|   1|   1|null|null|null|       0|       0|       0|
|   1|   2|null|null|null|       0|       0|       0|
|   1|   3|null|null|null|       0|       0|       0|
|   3|   1|null|10.0|null|       0|       1|       1|
|   3|   2|null| 2.0|   2|       0|       1|       1|
|   3|   3|null|20.0|   4|       0|       1|       1|
+----+----+----+----+----+--------+--------+--------+

【讨论】：

谢谢，这看起来可以解决问题。我会稍微研究一下你的答案并尝试先理解它。
不要犹豫，问问题。如您所见，没有groupBy。我使用了一个带有partitionBy 的窗口函数，它作为相同的角色但没有聚合。
所以一般来说，我可以使用窗口函数来处理“组”数据，但是当我不一定希望它聚合时，比如使用 groupby。您是否有一本可以推荐的好书/资源来开始查询数据和分析等？它是我没有太多实践的东西，因此当涉及到 SQL/Spark 时，我缺乏一些术语和对问题方法等的认识。
@Aesir 没什么可推荐的，抱歉。我通过我的经历锻造了自己。但是，是的，你明白了。