【问题标题】:PySpark dataframe to_json() functionPySpark 数据帧 to_json() 函数
【发布时间】:2018-09-11 04:18:49
【问题描述】:

我有一个如下所示的数据框,

>>> df.show(10,False)
+-----+----+---+------+
|id   |name|age|salary|
+-----+----+---+------+
|10001|alex|30 |75000 |
|10002|bob |31 |80000 |
|10003|deb |31 |80000 |
|10004|john|33 |85000 |
|10005|sam |30 |75000 |
+-----+----+---+------+

将df的整行转换为一个新列“jsonCol”,

>>> newDf1 = df.withColumn("jsonCol", to_json(struct([df[x] for x in df.columns])))
>>> newDf1.show(10,False)
+-----+----+---+------+--------------------------------------------------------+
|id   |name|age|salary|jsonCol                                                 |
+-----+----+---+------+--------------------------------------------------------+
|10001|alex|30 |75000 |{"id":"10001","name":"alex","age":"30","salary":"75000"}|
|10002|bob |31 |80000 |{"id":"10002","name":"bob","age":"31","salary":"80000"} |
|10003|deb |31 |80000 |{"id":"10003","name":"deb","age":"31","salary":"80000"} |
|10004|john|33 |85000 |{"id":"10004","name":"john","age":"33","salary":"85000"}|
|10005|sam |30 |75000 |{"id":"10005","name":"sam","age":"30","salary":"75000"} |
+-----+----+---+------+--------------------------------------------------------+

我需要一个解决方案来根据字段的值仅选择几列,而不是像上述步骤中那样将整行转换为 JSON 字符串。我在下面的命令中提供了一个示例条件。

但是当我开始使用 when 函数时,生成的 JSON 字符串的列名(键)消失了。仅按位置获取列名,而不是实际的列名(键)

>>> newDf2 = df.withColumn("jsonCol", to_json(struct([ when(col(x)!="  ",df[x]).otherwise(None) for x in df.columns])))
>>> newDf2.show(10,False)
+-----+----+---+------+---------------------------------------------------------+
|id   |name|age|salary|jsonCol                                                  |
+-----+----+---+------+---------------------------------------------------------+
|10001|alex|30 |75000 |{"col1":"10001","col2":"alex","col3":"30","col4":"75000"}|
|10002|bob |31 |80000 |{"col1":"10002","col2":"bob","col3":"31","col4":"80000"} |
|10003|deb |31 |80000 |{"col1":"10003","col2":"deb","col3":"31","col4":"80000"} |
|10004|john|33 |85000 |{"col1":"10004","col2":"john","col3":"33","col4":"85000"}|
|10005|sam |30 |75000 |{"col1":"10005","col2":"sam","col3":"30","col4":"75000"} |
+-----+----+---+------+---------------------------------------------------------+

我需要使用 when 函数,但要获得带有实际列名(键)的 newDf1 中的结果。有人可以帮帮我吗?

【问题讨论】:

    标签: apache-spark pyspark apache-spark-sql


    【解决方案1】:

    您在struct 函数中使用了条件作为列,条件列被重命名为 col1 col2 .... 这就是为什么您需要alias 来更改名称

    from pyspark.sql import functions as F
    newDf2 = df.withColumn("jsonCol", F.to_json(F.struct([F.when(F.col(x)!="  ",df[x]).otherwise(None).alias(x) for x in df.columns])))
    newDf2.show(truncate=False)
    

    【讨论】:

      猜你喜欢
      • 2018-06-08
      • 2020-06-10
      • 2022-07-02
      • 2017-03-16
      • 2021-04-10
      • 2018-10-23
      • 2020-06-10
      • 2021-03-14
      • 1970-01-01
      相关资源
      最近更新 更多