【问题标题】:How can I format JSON string after conversion from pyspark dataframe?从 pyspark 数据帧转换后如何格式化 JSON 字符串?
【发布时间】:2018-01-23 06:44:14
【问题描述】:

我已经通过在pyspark 中使用toJSON 将数据帧转换为JSON,这将每一行作为JSON 字符串。但我想重新格式化一下

我的代码如下:

spark=SparkSession.builder.config("spark.sql.warehouse.dir", "C:\spark\spark-warehouse").appName("TestApp").enableHiveSupport().getOrCreate()
sqlstring="SELECT lflow1.LeaseType as LeaseType, lflow1.Status as Status, lflow1.Property as property, lflow1.City as City, lesflow2.DealType as DealType, lesflow2.Area as Area, lflow1.Did as DID, lesflow2.MID as MID from lflow1, lesflow2  WHERE lflow1.Did = lesflow2.MID"

def queryBuilder(sqlval):
    df=spark.sql(sqlval)
    df.show()
    return df

result=queryBuilder(sqlstring)
resultlist=result.toJSON().collect()
print(resultlist)
print("Type of",type(resultlist))

在这之后,输出是:

[
    '{"LeaseType":"Offer to Lease","Status":"Fully Executed","property":"10230104","City":"Edmonton","DealType":"Renewal","Area":"2312","DID":"79cc3959ffc8403f943ff0e7e93584f8","MID":"79cc3959ffc8403f943ff0e7e93584f8"}',
    '{"LeaseType":"Offer to Renew","Status":"Fully Executed","property":"1040HAMI","City":"Vancouver","DealType":"Renewal","Area":"784","DID":"ecf922d0583247c0a4cb591bd4b3843e","MID":"ecf922d0583247c0a4cb591bd4b3843e"}', 
    '{"LeaseType":"Offer to Renew","Status":"Fully Executed","property":"1040HAMI","City":"Vancouver","DealType":"Renewal","Area":"2223","DID":"ecf922d0583247c0a4cb591bd4b3843e","MID":"ecf922d0583247c0a4cb591bd4b3843e"}', 
    '{"LeaseType":"Offer to Lease","Status":"Conditional","property":"106PORTW","City":"Toronto","DealType":"Renewal","Area":"2212","DID":"69c3af0527014fd99d1ccf156c0bce93","MID":"69c3af0527014fd99d1ccf156c0bce93"}', 
    '{"LeaseType":"Offer to Lease","Status":"Fully Executed","property":"106PORTW","City":"Toronto","DealType":"0","Area":"","DID":"04aedb01da5d44fead7e1315115c2530","MID":"04aedb01da5d44fead7e1315115c2530"}'
]

但我想格式化这个 JSON 数组,例如:以下两行:

[
    {
        "LeaseType": "Offer to Lease",
        "Status": "Fully Executed",
        "property": "10230104",
        "City": "Edmonton",
        "DealType": "Renewal",
        "Area": "2312",
        "DID": "79cc3959ffc8403f943ff0e7e93584f8",
        "MID": "79cc3959ffc8403f943ff0e7e93584f8"
    },
    {
        "LeaseType": "Offer to Renew",
        "Status": "Fully Executed",
        "property": "1040HAMI",
        "City": "Vancouver",
        "DealType": "Renewal",
        "Area": "784",
        "DID": "ecf922d0583247c0a4cb591bd4b3843e",
        "MID": "ecf922d0583247c0a4cb591bd4b3843e"
    }
]

我想在这里省略 '

【问题讨论】:

    标签: json apache-spark pyspark apache-spark-sql


    【解决方案1】:
    import re
    import json
    
    resultlist = [
        '{"LeaseType":"Offer to Lease","Status":"Fully Executed","property":"10230104","City":"Edmonton","DealType":"Renewal","Area":"2312","DID":"79cc3959ffc8403f943ff0e7e93584f8","MID":"79cc3959ffc8403f943ff0e7e93584f8"}',
        '{"LeaseType":"Offer to Renew","Status":"Fully Executed","property":"1040HAMI","City":"Vancouver","DealType":"Renewal","Area":"784","DID":"ecf922d0583247c0a4cb591bd4b3843e","MID":"ecf922d0583247c0a4cb591bd4b3843e"}',
        '{"LeaseType":"Offer to Renew","Status":"Fully Executed","property":"1040HAMI","City":"Vancouver","DealType":"Renewal","Area":"2223","DID":"ecf922d0583247c0a4cb591bd4b3843e","MID":"ecf922d0583247c0a4cb591bd4b3843e"}',
        '{"LeaseType":"Offer to Lease","Status":"Conditional","property":"106PORTW","City":"Toronto","DealType":"Renewal","Area":"2212","DID":"69c3af0527014fd99d1ccf156c0bce93","MID":"69c3af0527014fd99d1ccf156c0bce93"}',
        '{"LeaseType":"Offer to Lease","Status":"Fully Executed","property":"106PORTW","City":"Toronto","DealType":"0","Area":"","DID":"04aedb01da5d44fead7e1315115c2530","MID":"04aedb01da5d44fead7e1315115c2530"}'
    ]
    
    data_to_dump = re.sub(r"\'", "", str(resultlist))
    json_data= json.dumps(data_to_dump)
    print json_data
    

    【讨论】:

    • 不要使用 re 模块。正确json.loads json 字符串
    【解决方案2】:

    您有一个 JSON 字符串列表,因此如果您想将整个列表作为 JSON 块获取,您可以将 JSON 加载回 python 字典,然后序列化整个列表

    import json
    
    resultlist_json = [json.loads(x) for x in resultlist] 
    print(json.dumps(resultlist_json, sort_keys=True, indent=4))
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2020-10-02
      • 2018-02-19
      • 2020-07-31
      • 2020-09-09
      • 1970-01-01
      • 2019-12-21
      相关资源
      最近更新 更多