【问题标题】:PySpark: How to Update Nested Columns?PySpark:如何更新嵌套列?
【发布时间】:2019-04-25 14:15:24
【问题描述】:

StackOverflow 提供了一些关于如何更新数据框中嵌套列的答案。但是,其中一些看起来有点复杂。

在搜索时,我从 DataBricks 中找到了处理相同场景的此文档:https://docs.databricks.com/user-guide/faq/update-nested-column.html

val updated = df.selectExpr("""
    named_struct(
        'metadata', metadata,
        'items', named_struct(
          'books', named_struct('fees', items.books.fees * 1.01),
          'paper', items.paper
        )
    ) as named_struct
""").select($"named_struct.metadata", $"named_struct.items")

这看起来也很干净。不幸的是,我不知道 Scala。我如何将它翻译成 Python?

【问题讨论】:

    标签: scala apache-spark pyspark apache-spark-sql


    【解决方案1】:

    这可能会帮助您入门;使用 1 行 ex 将您的 Databricks 链接转换为 python。供您探索

    from pyspark.sql.functions import *
    from pyspark.sql.types import *
    
    schema = StructType()\
    .add("metadata", StructType()\
         .add("eventid", IntegerType(), True)\
         .add("hostname", StringType(), True)\
         .add("timestamp", StringType(), True))\
    .add("items", StructType()\
         .add("books", StructType()\
             .add("fees", DoubleType(), True))\
         .add("paper", StructType()\
             .add("pages", IntegerType(), True)))
    
    nested_row = [
    
        (
            {
                "metadata": {
                    "eventid": 9,
                    "hostname": "999.999.999",
                    "timestamp": "9999-99-99 99:99:99"
                },
                "items": {
                    "books": {
                        "fees": 99.99
                    },
    
                    "paper": {
                        "pages": 9999
                    }
                }
            }
        )
    ]
    
    df = spark.createDataFrame(nested_row, schema)
    
    df.printSchema()
    
    df.selectExpr("""
        named_struct(
            'metadata', metadata,
            'items', named_struct(
              'books', named_struct('fees', items.books.fees * 1.01),
              'paper', items.paper
            )
        ) as named_struct
    """).select(col("named_struct.metadata"), col("named_struct.items"))\
    .show(truncate=False)
    
    root
     |-- metadata: struct (nullable = true)
     |    |-- eventid: integer (nullable = true)
     |    |-- hostname: string (nullable = true)
     |    |-- timestamp: string (nullable = true)
     |-- items: struct (nullable = true)
     |    |-- books: struct (nullable = true)
     |    |    |-- fees: double (nullable = true)
     |    |-- paper: struct (nullable = true)
     |    |    |-- pages: integer (nullable = true)
    
    +-------------------------------------+-----------------+
    |metadata                             |items            |
    +-------------------------------------+-----------------+
    |[9, 999.999.999, 9999-99-99 99:99:99]|[[99.99], [9999]]|
    +-------------------------------------+-----------------+
    
    +-------------------------------------+------------------------------+
    |metadata                             |items                         |
    +-------------------------------------+------------------------------+
    |[9, 999.999.999, 9999-99-99 99:99:99]|[[100.98989999999999], [9999]]|
    +-------------------------------------+------------------------------+
    
    

    【讨论】:

    • 能否请您链接到 pyspark 文档,其中描述了 named_struct 的用法?
    猜你喜欢
    • 1970-01-01
    • 2021-09-29
    • 2018-01-31
    • 1970-01-01
    • 2022-06-13
    • 1970-01-01
    • 1970-01-01
    • 2020-04-30
    • 1970-01-01
    相关资源
    最近更新 更多