【发布时间】:2019-08-11 17:38:15
【问题描述】:
我正在尝试使用 Python "re" 库和 python 切片的任意组合来纠正 Kafka 在 HDFS 上使用的格式不正确的 JSON 字符串Cloudera 的 Hadoop 发行版。
不正确的json:
{"json_data":"{"table":"TEST.FUBAR","op_type":"I","op_ts":"2019-03-14 15:33:50.031848","current_ts":"2019-03-14T15:33:57.479002","pos":"1111","after":{"COL1":949494949494949494,"COL2":99,"COL3":2,"COL4":" 99999","COL5":9999999,"COL6":90,"COL7":42478,"COL8":"I","COL9":null,"COL10":"2019-03-14 15:33:49","COL11":null,"COL12":null,"COL13":null,"COL14":"x222263 ","COL15":"2019-03-14 15:33:49","COL16":"x222263 ","COL17":"2019-03-14 15:33:49","COL18":"2020-09-10 00:00:00","COL19":"A","COL20":"A","COL21":0,"COL22":null,"COL23":"2019-03-14 15:33:47","COL24":2,"COL25":2,"COL26":"R","COL27":"2019-03-14 15:33:49","COL28":" ","COL29":"PBU67H ","COL30":" 20000","COL31":2,"COL32":null}}"}
注意: 开始标签 "json_data":"{ 附近的双引号和双引号在 "null}}"} 的末尾附近实际上是唯一需要删除的错误(我已经在没有额外引号的情况下对其进行了测试)
有效且正确的json:
{"json_data":{"table":"TEST.FUBAR","op_type":"I","op_ts":"2019-03-14 15:33:50.031848","current_ts":"2019-03-14T15:33:57.479002","pos":"1111","after":{"COL1":949494949494949494,"COL2":99,"COL3":2,"COL4":" 99999","COL5":9999999,"COL6":90,"COL7":42478,"COL8":"I","COL9":null,"COL10":"2019-03-14 15:33:49","COL11":null,"COL12":null,"COL13":null,"COL14":"x222263 ","COL15":"2019-03-14 15:33:49","COL16":"x222263 ","COL17":"2019-03-14 15:33:49","COL18":"2020-09-10 00:00:00","COL19":"A","COL20":"A","COL21":0,"COL22":null,"COL23":"2019-03-14 15:33:47","COL24":2,"COL25":2,"COL26":"R","COL27":"2019-03-14 15:33:49","COL28":" ","COL29":"PBU67H ","COL30":" 20000","COL31":2,"COL32":null}}}
我有 40,000 到 60,000 条记录,我需要使用 Pyspark 每小时读取一次,而基础架构团队说这需要我来解决。
有没有一种快速而肮脏的方式使用python读取所有字符串并删除开头和结尾附近的双引号?
【问题讨论】:
-
如果有问题的前缀/预告片总是相同的,你可以像
loads(bad_json[14:-3])一样去掉它们。更好的是,说服懒惰的开发人员解决这个问题——这显然是他们的错。 -
使用 strip 方法绝对是我的第一次尝试,但并非所有 JSON 字符串的长度都相同或属性数量相同。
标签: python json apache-spark hadoop pyspark