Pyspark - 使用 collect_list 时保留空值答案

【问题标题】：Pypsark - Retain null values when using collect_listPyspark - 使用 collect_list 时保留空值
【发布时间】：2018-08-29 22:20:44
【问题描述】：

根据pyspark collect_set or collect_list with groupby 中的接受的答案，当您在某个列上执行collect_list 时，该列中的null 值将被删除。我查过了，这是真的。

但在我的情况下，我需要保留空列——我怎样才能做到这一点？

我没有找到关于这种collect_list 函数变体的任何信息。

背景背景来解释我为什么想要空值：

我有一个数据框df 如下：

cId   |  eId  |  amount  |  city
1     |  2    |   20.0   |  Paris
1     |  2    |   30.0   |  Seoul
1     |  3    |   10.0   |  Phoenix
1     |  3    |   5.0    |  null

我想通过以下映射将其写入 Elasticsearch 索引：

"mappings": {
    "doc": {
        "properties": {
            "eId": { "type": "keyword" },
            "cId": { "type": "keyword" },
            "transactions": {
                "type": "nested", 
                "properties": {
                    "amount": { "type": "keyword" },
                    "city": { "type": "keyword" }
                }
            }
        }
    }
 }

为了符合上面的嵌套映射，我转换了我的 df，以便对于 eId 和 cId 的每个组合，我都有一个这样的事务数组：

df_nested = df.groupBy('eId','cId').agg(collect_list(struct('amount','city')).alias("transactions"))
df_nested.printSchema()
root
 |-- cId: integer (nullable = true)
 |-- eId: integer (nullable = true)
 |-- transactions: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- amount: float (nullable = true)
 |    |    |-- city: string (nullable = true)

将df_nested保存为json文件，有我得到的json记录：

{"cId":1,"eId":2,"transactions":[{"amount":20.0,"city":"Paris"},{"amount":30.0,"city":"Seoul"}]}
{"cId":1,"eId":3,"transactions":[{"amount":10.0,"city":"Phoenix"},{"amount":30.0}]}

如您所见-当cId=1 和eId=3 时，我的数组元素之一amount=30.0 没有city 属性，因为这是我原始数据中的null (df) .当我使用 collect_list 函数时，空值被删除。

但是，当我尝试使用上述索引将 df_nested 写入 elasticsearch 时，由于架构不匹配而出错。这基本上就是为什么我想在应用 collect_list 函数后保留我的空值的原因。

【问题讨论】：

为什么需要空值？你能提供一个示例 DataFrame 和所需的输出吗？
@pault 我需要空值，因为我正在尝试创建嵌套数据帧并将其写入弹性搜索。因此数据框的架构必须与我设置的弹性搜索映射完全匹配。更新我的问题以显示示例。
@pault - 更新了我的问题以提供更好的上下文。
是否可以将null 值替换为其他值，例如字符串'null'？

标签： nested pyspark-sql collect elasticsearch-hadoop elasticsearch-mapping

【解决方案1】：

    from pyspark.sql.functions import create_map, collect_list, lit, col, to_json, from_json
    from pyspark import SparkContext, SparkConf
    from pyspark.sql import SQLContext, HiveContext, SparkSession, types, Row
    from pyspark.sql import functions as f
    import os
    
    app_name = "CollList"
    conf = SparkConf().setAppName(app_name)
    spark = SparkSession.builder.appName(app_name).config(conf=conf).enableHiveSupport().getOrCreate()
    
    df = spark.createDataFrame([[1, 2, 20.0, "Paris"], [1, 2, 30.0, "Seoul"],
        [1, 3, 10.0, "Phoenix"], [1, 3, 5.0, None]],
        ["cId", "eId", "amount", "city"])
    print("Actual data")
    df.show(10,False)
```
Actual data
+---+---+------+-------+
|cId|eId|amount|city   |
+---+---+------+-------+
|1  |2  |20.0  |Paris  |
|1  |2  |30.0  |Seoul  |
|1  |3  |10.0  |Phoenix|
|1  |3  |5.0   |null   |
+---+---+------+-------+
```
    #collect_list that skips null columns
    df1 = df.groupBy(f.col('city'))\
            .agg(f.collect_list(f.to_json(f.struct([f.col(x).alias(x) for x in (c for c in df.columns if c != 'cId' and c != 'eId' )])))).alias('newcol')
    print("Collect List Data - Missing Null Columns in the list")
    df1.show(10, False)
```
Collect List Data - Missing Null Columns in the list
+-------+-------------------------------------------------------------------------------------------------------------------+
|city   |collect_list(structstojson(named_struct(NamePlaceholder(), amount AS `amount`, NamePlaceholder(), city AS `city`)))|
+-------+-------------------------------------------------------------------------------------------------------------------+
|Phoenix|[{"amount":10.0,"city":"Phoenix"}]                                                                                 |
|null   |[{"amount":5.0}]                                                                                                   |
|Paris  |[{"amount":20.0,"city":"Paris"}]                                                                                   |
|Seoul  |[{"amount":30.0,"city":"Seoul"}]                                                                                   |
+-------+-------------------------------------------------------------------------------------------------------------------+
``` 
    my_list = []
    for x in (c for c in df.columns if c != 'cId' and c != 'eId' ):
        my_list.append(lit(x))
        my_list.append(col(x))
    
    grp_by = ["eId","cId"]
    df_nested = df.withColumn("transactions", create_map(my_list))\
                  .groupBy(grp_by)\
                  .agg(collect_list(f.to_json("transactions")).alias("transactions"))
    
    print("collect list after create_map")
    df_nested.show(10,False)
```
collect list after create_map
+---+---+--------------------------------------------------------------------+
|eId|cId|transactions                                                        |
+---+---+--------------------------------------------------------------------+
|2  |1  |[{"amount":"20.0","city":"Paris"}, {"amount":"30.0","city":"Seoul"}]|
|3  |1  |[{"amount":"10.0","city":"Phoenix"}, {"amount":"5.0","city":null}]  |
+---+---+--------------------------------------------------------------------+
```

【讨论】：

请记住create_map 会将key: value 转换为string: string，因此amount 的值是字符串而不是浮点数