如何在 Pyspark 中更新架构答案

【问题标题】：How to Update Schema in Pyspark如何在 Pyspark 中更新架构
【发布时间】：2021-05-24 00:45:44
【问题描述】：

我有一个 JSON 数据集，其中有以下架构

myjsondata = spark.read.json("/FileStore/tables/customer.json")

myjsondata.printSchema()

我想更新这个架构，所以我使用了下面的命令

myjsondataDDL="address_id INT,birth_country String,birthdate date,customer_id INT,demographics STRUCT<buy_potential: string,credit_rating: string,education_status: string,income_range: array<>,purchase_estimate:INT,vehicle_count: INT>,email_address: string,firstname: string,gender: string,is_preffered_customer: string,lastname: string,salutation: string"

我无法在此处更新架构。该怎么做？

【问题讨论】：

标签： python json apache-spark pyspark apache-spark-sql

【解决方案1】：

试试下面的这个架构。您的架构有一些语法错误，有一些不需要的冒号（冒号仅用于结构类型内的字段名称）和缺少数组类型。

myjsondataDDL = """
    address_id INT,
    birth_country String,
    birthdate date,
    customer_id INT,
    demographics STRUCT<buy_potential: string, credit_rating: string, education_status: string, income_range: array<int>, purchase_estimate:INT, vehicle_count: INT>,
    email_address string,
    firstname string,
    gender string,
    is_preffered_customer string,
    lastname string,
    salutation string
"""
myjsondata = spark.read.schema(myjsondataDDL).json('absolute path of file')

【讨论】：