如何在 PySpark 中将嵌套字段的值设为空？答案

【问题标题】：How to make values of a nested field to null in PySpark?如何在 PySpark 中将嵌套字段的值设为空？
【发布时间】：2018-09-25 09:04:00
【问题描述】：

考虑以下架构：

root
 |-- A: string (nullable = true)
 |-- B: string (nullable = true)
 |-- C: string (nullable = true)
 |-- D: struct (nullable = true)
 |    |-- d1: struct (nullable = true)
 |    |    |-- timestamp: string (nullable = true)
 |    |    |-- timeZoneType: string (nullable = true)
 |    |    |-- zoneName: string (nullable = true)
 |    |-- d2: string (nullable = true)
 |    |-- d3: string (nullable = true)
 |-- E: array (nullable = true)
 |    |-- e1: struct (nullable = true)
 |    |    |-- transactionId: string (nullable = true)
 |    |    |-- timeStamp: string (nullable = true)
 |    |    |-- instanceId: string (nullable = true)
 |    |    |-- userId: string (nullable = true)
 |    |    |-- reason: string (nullable = true)
 |    |-- e2: array (nullable = true)
 |    |    |-- transactionId: string (nullable = true)
 |    |    |-- timeStamp: string (nullable = true)
 |    |    |-- instanceId: string (nullable = true)
 |    |    |-- userId: string (nullable = true)
 |    |    |-- reason: string (nullable = true)
 |    |    |-- additionalData: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)

如何在 PySpark 中从 DataFrame 删除一组列值而不从架构中删除它？这与整个架构中的dropping specific columns 不同。

假设要保留的列在列表keepColumns 中。我想用NULL 替换所有other 列的条目，同时保持keepColumns 的条目不变。

例如，

keepColumns = ["C",
               "D.d1.zoneName",
               "E.e1.reason",
               "E.e2.timeStamp"]

注意嵌套的 Array 和 Struct 字段。我什至不能在 ArrayType 的子字段上使用select，除非我使用像select E.e2[0].timeStamp from table1 这样的索引（在应用df.createOrReplaceTempView("table1") 之后）。

遵循this post 中给出的投票最多的解决方案也不起作用。它仅显示现有值没有变化。

【问题讨论】：

标签： python-3.x pyspark

【解决方案1】：

我在嵌套结构字段中遇到了同样的问题，我希望它们是 StringType，但用空值填充。如果不首先使用空字符串，我无法让它保留类型。

这对我有用，对空字符串使用 UDF，因此 Spark 仍然推断 StringType（稍微修改您的 UDF）：

    def nullify(col):
        return F.when(col == '', F.lit(None)).otherwise(col)


    # Does not work
    >>> df.select(F.struct(F.lit(None).alias('test'))).printSchema()
    root
     |-- named_struct(test, NULL AS `test`): struct (nullable = false)
     |    |-- test: null (nullable = true)

    # Works!
    >>> df.select(F.struct(nullify(F.lit('')).alias('test'))).printSchema()
    root
     |-- named_struct(test, nullify() AS `test`): struct (nullable = false)
     |    |-- test: string (nullable = true)

请注意，我是动态创建结构的，所以我在创建它们时应用它。如果您已经阅读过一个结构，那就不同了——在这种情况下，您必须将其展平并重新构建它。

【讨论】：