【问题标题】:How to Update Schema in Pyspark如何在 Pyspark 中更新架构
【发布时间】:2021-05-24 00:45:44
【问题描述】:

我有一个 JSON 数据集,其中有以下架构

myjsondata = spark.read.json("/FileStore/tables/customer.json")

myjsondata.printSchema()

我想更新这个架构,所以我使用了下面的命令

myjsondataDDL="address_id INT,birth_country String,birthdate date,customer_id INT,demographics STRUCT<buy_potential: string,credit_rating: string,education_status: string,income_range: array<>,purchase_estimate:INT,vehicle_count: INT>,email_address: string,firstname: string,gender: string,is_preffered_customer: string,lastname: string,salutation: string" 

我无法在此处更新架构。该怎么做?

【问题讨论】:

    标签: python json apache-spark pyspark apache-spark-sql


    【解决方案1】:

    试试下面的这个架构。您的架构有一些语法错误,有一些不需要的冒号(冒号仅用于结构类型内的字段名称)和缺少数组类型。

    myjsondataDDL = """
        address_id INT,
        birth_country String,
        birthdate date,
        customer_id INT,
        demographics STRUCT<buy_potential: string, credit_rating: string, education_status: string, income_range: array<int>, purchase_estimate:INT, vehicle_count: INT>,
        email_address string,
        firstname string,
        gender string,
        is_preffered_customer string,
        lastname string,
        salutation string
    """
    myjsondata = spark.read.schema(myjsondataDDL).json('absolute path of file')
    

    【讨论】:

      猜你喜欢
      • 2022-01-23
      • 1970-01-01
      • 2016-01-01
      • 2022-01-09
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2021-12-22
      • 1970-01-01
      相关资源
      最近更新 更多