【问题标题】:Pyspark: Create Schema from Json Schema involving Array columnsPyspark:从涉及数组列的 Json Schema 创建 Schema
【发布时间】:2019-10-13 19:09:22
【问题描述】:

我在 json 文件中为 df 定义了我的架构,如下所示:

{
    "table1":{
        "fields":[
            {"metadata":{}, "name":"first_name", "type":"string", "nullable":false},
            {"metadata":{}, "name":"last_name", "type":"string", "nullable":false},
            {"metadata":{}, "name":"subjects", "type":"array","items":{"type":["string", "string"]}, "nullable":false},
            {"metadata":{}, "name":"marks", "type":"array","items":{"type":["integer", "integer"]}, "nullable":false},
            {"metadata":{}, "name":"dept", "type":"string", "nullable":false}       
        ]
    }

}

EG JSON 数据:

{
    "table1": [
        {
            "first_name":"john",
            "last_name":"doe",
            "subjects":["maths","science"],
            "marks":[90,67],
            "dept":"abc"        
        },
        {
            "first_name":"dan",
            "last_name":"steyn",
            "subjects":["maths","science"],
            "marks":[90,67],
            "dept":"abc"        
        },
        {
            "first_name":"rose",
            "last_name":"wayne",
            "subjects":["maths","science"],
            "marks":[90,67],
            "dept":"abc"            
        },
        {
            "first_name":"nat",
            "last_name":"lee",
            "subjects":["maths","science"],
            "marks":[90,67],
            "dept":"abc"        
        },
        {
            "first_name":"jim",
            "last_name":"lim",
            "subjects":["maths","science"],
            "marks":[90,67],
            "dept":"abc"        
        }       
    ]
}

我想从这个 json 文件创建等效的 spark 模式。下面是我的代码:(参考:Create spark dataframe schema from json schema representation

with open(schemaFile) as s:
 schema = json.load(s)["table1"]
 source_schema = StructType.fromJson(schema)

如果我没有任何数组列,上面的代码可以正常工作。但是如果我的架构中有数组列,则会引发以下错误。

“无法解析数据类型:数组” ("无法解析数据类型:%s" json_value)

【问题讨论】:

  • 你试过倒着做吗?您将模式创建为 Python 对象,包括数组,然后将其转换为 json,看看有什么区别。
  • 提供的架构无效,"items":{"type":["string", "string"]} 后缺少逗号。我认为最好发布您的实际数据,或者只是尝试在 Spark 中加载 json,然后导出由 Spark 创建的架构
  • @AlexandrosBiratsis:架构已更新。我的实际数据是一个 csv 文件。我正在尝试将此模式包含在具有多个模式的 json 文件中,并且在读取 spark 中的 csv 文件时,我将参考此 json 文件以获取正确的模式以提供正确的列标题和数据类型。
  • 是的,我看到了@blackfury,尽管您的架构再次无效! "items":{"type":["string", "string"]} 不是一个有效的定义,你到底想在这里说什么?你能发布一些实际的json数据吗?
  • @AlexandrosBiratsis:添加了一个示例 json 数据

标签: json dataframe pyspark schema


【解决方案1】:

在您的情况下,数组的表示存在问题。正确的语法是:

{ "metadata": {}, "name": "marks", "nullable": true, "type": {"containsNull": true, "elementType": "long", "type": "array" } }.

为了从 json 中检索模式,您可以编写下一个 pyspark sn-p:

jsonData = """{
    "table1": [{
            "first_name": "john",
            "last_name": "doe",
            "subjects": ["maths", "science"],
            "marks": [90, 67],
            "dept": "abc"
        },
        {
            "first_name": "dan",
            "last_name": "steyn",
            "subjects": ["maths", "science"],
            "marks": [90, 67],
            "dept": "abc"
        },
        {
            "first_name": "rose",
            "last_name": "wayne",
            "subjects": ["maths", "science"],
            "marks": [90, 67],
            "dept": "abc"
        },
        {
            "first_name": "nat",
            "last_name": "lee",
            "subjects": ["maths", "science"],
            "marks": [90, 67],
            "dept": "abc"
        },
        {
            "first_name": "jim",
            "last_name": "lim",
            "subjects": ["maths", "science"],
            "marks": [90, 67],
            "dept": "abc"
        }
    ]
}"""

df = spark.read.json(sc.parallelize([jsonData]))

df.schema.json()

这应该输出:

{
    "fields": [{
        "metadata": {},
        "name": "table1",
        "nullable": true,
        "type": {
            "containsNull": true,
            "elementType": {
                "fields": [{
                    "metadata": {},
                    "name": "dept",
                    "nullable": true,
                    "type": "string"
                }, {
                    "metadata": {},
                    "name": "first_name",
                    "nullable": true,
                    "type": "string"
                }, {
                    "metadata": {},
                    "name": "last_name",
                    "nullable": true,
                    "type": "string"
                }, {
                    "metadata": {},
                    "name": "marks",
                    "nullable": true,
                    "type": {
                        "containsNull": true,
                        "elementType": "long",
                        "type": "array"
                    }
                }, {
                    "metadata": {},
                    "name": "subjects",
                    "nullable": true,
                    "type": {
                        "containsNull": true,
                        "elementType": "string",
                        "type": "array"
                    }
                }],
                "type": "struct"
            },
            "type": "array"
        }
    }],
    "type": "struct"
}

或者,您可以使用df.schema.simpleString(),这将返回一个相对简单的架构格式:

struct<table1:array<struct<dept:string,first_name:string,last_name:string,marks:array<bigint>,subjects:array<string>>>>

最后,您可以将上面的架构存储到文件中,稍后使用:

import json
new_schema = StructType.fromJson(json.loads(schema_json))

正如你已经做过的那样。 请记住,您也可以为任何 json 数据动态地实现所描述的过程。

【讨论】:

    猜你喜欢
    • 2018-07-01
    • 1970-01-01
    • 2020-05-08
    • 1970-01-01
    • 2012-04-08
    • 1970-01-01
    • 2019-12-21
    • 1970-01-01
    • 2020-05-07
    相关资源
    最近更新 更多