【问题标题】:Corrupted record showing in pyspark After Creating data frame from saved json file从保存的 json 文件创建数据框后,pyspark 中显示损坏的记录
【发布时间】:2017-01-24 08:21:22
【问题描述】:

我已将 url 中的 json 数据保存到 spark 文件夹中名为 urljson.json 的 json 文件中。并执行了以下代码以在其上创建数据框 以此

path="urljson.json/"
testdf1=spark.read.json(path)
testdf1.show()

我有这个

执行后

tesdf1.printSchema()

下面的格式显示 根 |-- _corrupt_record: string (nullable = true)

我该如何解决这个问题任何指导将不胜感激 我正在使用火花 2.0

我的 json 数据看起来像这样,它非常大,我已经发布了其中的一部分

result:[{"BldgID":"1006AVE ","BldgName":"100-6th Avenue SW (Oddfellows)          ","BldgCity":"Calgary             ","BldgState":"AB ","BldgZip":"T2G 2C4  ","BldgAddress1":"100-6th Avenue Southwest                ","BldgAddress2":"ZZZ None","BldgPhone":"4035439600     ","BldgLandlord":"1006AV","BldgLandlordName":"100-6 TH Avenue SW Inc.                                     ","BldgManager":"AVANDE","BldgManagerName":"Alyssa Van de Vorst           ","BldgManagerType":"Internal","BldgGLA":"34242","BldgEntityID":"1006AVE ","BldgInactive":"N","BldgPropType":"ZZZ None","BldgPropTypeDesc":"ZZZ None","BldgPropSubType":"ZZZ None","BldgPropSubTypeDesc":"ZZZ None","BldgRetailFlag":"N","BldgEntityType":"REIT                     ","BldgCityName":"Calgary             ","BldgDistrictName":"Downtown            ","BldgRegionName":"Western Canada                                    ","BldgAccountantID":"KKAUN     ","BldgAccountantName":"Kendra Kaun                   ","BldgAccountantMgrID":"LVALIANT  ","BldgAccountantMgrName":"Lorretta Valiant                        ","BldgFASBStartDate":"2012-10-24","BldgFASBStartDateStr":"2012-10-24"},{"BldgID":"1007AVE ","BldgName":"100-7th Avenue Southwest-Art Central    ","BldgCity":"Calgary             ","BldgState":"AB ","BldgZip":"T2P 0W4  ","BldgAddress1":"100-7th Avenue Southwest                ","BldgAddress2":"ZZZ None","BldgPhone":"4035439600     ","BldgLandlord":"1007AV","BldgLandlordName":"100-7th Avenue SW (Art Central) Inc.                        ","BldgManager":"LPATER","BldgManagerName":"Lyndsey Paterson              ","BldgManagerType":"Internal","BldgGLA":"27127","BldgEntityID":"1007AVE ","BldgInactive":"N","BldgPropType":"ZZZ None","BldgPropTypeDesc":"ZZZ None","BldgPropSubType":"ZZZ None","BldgPropSubTypeDesc":"ZZZ None","BldgRetailFlag":"N","BldgEntityType":"Property Under Dev't     ","BldgCityName":"Calgary             ","BldgDistrictName":"Downtown            ","BldgRegionName":"Western Canada                                    ","BldgAccountantID":"ABRITTON  ","BldgAccountantName":"Angie Britton                 ","BldgAccountantMgrID":"ZZZ None","BldgAccountantMgrName":"ZZZ None","BldgFASBStartDate":"2011-09-01","BldgFASBStartDateStr":"2011-09-01"},{"BldgID":"100LOMB ","BldgName":"100 Lombard Street                      ","BldgCity":"Toronto             ","BldgState":"ON ","BldgZip":"M5C 1M3  ","BldgAddress1":"100 Lombard Street                      ","BldgAddress2":"ZZZ None","BldgPhone":"4169779002     ","BldgLandlord":"100LOM","BldgLandlordName":"100 Lombard Street Inc.                                     ","BldgManager":"TCHALM","BldgManagerName":"Tiffany Chalmers              ","BldgManagerType":"Internal","BldgGLA":"43697.64","BldgEntityID":"100LOMB ","BldgInactive":"N","BldgPropType":"ZZZ None","BldgPropTypeDesc":"ZZZ None","BldgPropSubType":"ZZZ None","BldgPropSubTypeDesc":"ZZZ None","BldgRetailFlag":"N","BldgEntityType":"REIT                     ","BldgCityName":"Toronto             ","BldgDistrictName":"Queen - Richmond    ","BldgRegionName":"Central Canada                                    ","BldgAccountantID":"MALLORDE  ","BldgAccountantName":"May Ann Allorde               ","BldgAccountantMgrID":"TTSANG    ","BldgAccountantMgrName":"Tony Tsang                              ","BldgFASBStartDate":"2005-11-01","BldgFASBStartDateStr":"2005-11-01"},{"BldgID":"10190104","BldgName":"10190-104th Street NW-The Metals Buildi ","BldgCity":"Edmonton            ","BldgState":"AB ","BldgZip":"T5J 1A7  ","BldgAddress1":"10190-104st Street SW                   ","BldgAddress2":"ZZZ None","BldgPhone":"7804234400     ","BldgLandlord":"10190 ","BldgLandlordName":"10190-104 Street Inc.                                       ","BldgManager":"NEWWES","BldgManagerName":"New West Enterprise Property  ","BldgManagerType":"Third   ","BldgGLA":"20447.75","BldgEntityID":"10190104","BldgInactive":"N","BldgPropType":"ZZZ None","BldgPropTypeDesc":"ZZZ None","BldgPropSubType":"ZZZ None","BldgPropSubTypeDesc":"ZZZ None","BldgRetailFlag":"N","BldgEntityType":"REIT                     ","BldgCityName":"Edmonton            ","BldgDistrictName":"Edmonton            ","BldgRegionName":"Western Canada                                    ","BldgAccountantID":"RYANG     ","BldgAccountantName":"Raymond Yang                  ","BldgAccountantMgrID":"LVALIANT  ","BldgAccountantMgrName":"Lorretta Valiant                        ","BldgFASBStartDate":"2011-08-08","BldgFASBStartDateStr":"2011-08-08"}]

【问题讨论】:

  • 在不知道你的 json 文件长什么样的情况下很难分辨。
  • 问题可能在于 json 文档不在一行中,并且您的 json 文档中有换行符。
  • 请输入您的 Json 文件以获取更多详细信息,大多数情况下问题与@RajatMishra 描述的完全相同!
  • 我已经发布了 json 数据。这是一个非常大的集合。我已经发布了其中的一部分

标签: json apache-spark pyspark spark-dataframe pyspark-sql


【解决方案1】:

检查您在 http://jsonlint.com/ 中提供的部分 JSON 导致错误:not a valid JSON

从部分 JSON 中删除 result: 并签入 http://jsonlint.com/ 导致 valid JSON

请注意,在您的情况下,即使从 JSON 输入中删除“结果:”,也可能不会导致值 spark JSON 输入,因为 spark 仅支持有限类型的 JSON:

http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets

JSON 数据集

Spark SQL 可以自动推断 JSON 数据集的架构并将其加载为 Dataset[Row]。可以在字符串的 RDD 或 JSON 文件上使用 SparkSession.read.json() 完成此转换。

请注意,作为 json 文件提供的文件 不是典型的 JSON 文件。每行必须包含一个单独的、自包含的有效 JSON 对象。有关更多信息,请参阅 JSON 行文本格式,也称为换行符分隔的 JSON。因此,常规的多行 JSON 文件通常会失败。

【讨论】:

    猜你喜欢
    • 2021-05-09
    • 2021-07-31
    • 1970-01-01
    • 1970-01-01
    • 2015-06-24
    • 2019-12-06
    • 1970-01-01
    • 2016-03-13
    • 1970-01-01
    相关资源
    最近更新 更多