【问题标题】:JSON File not reading pandasJSON文件不读取熊猫
【发布时间】:2019-03-31 17:59:22
【问题描述】:

我有一个带有音乐声学特征的 JSON 文件(大约 1GB)。我正在尝试将其读入我的熊猫笔记本中 dataf = "/home/work/my.json" d = json.load(open(dataf, 'r')) 它一直给我一个错误说

额外数据:第 2 行第 1 列(字符 499)

我知道第 499 个字符是下一首曲目的开始,但我在网上查看过,不确定如何读取它。 以下是数据示例。

{"_id":{"$oid":"5b2cff21aecd2a723459cd65"},"id":1,"sp_id":"0XLOf9LhyazPX9Ld8jPiUq","danceability":0.7079999999999999627,"energy":0.60999999999999998668,"key":" 2“,”响度“: - 4.522000000000000002416,”模式“:”1“,”演示“:0.057399999999999999634,”声学“:0.02040000000000000001465,”乐器“:4.449999999999997457E-06,”活跃“:0.064100000000000004197,”价“:0.0349999999999997 ,"节奏":123.0379999999999967,"time_signature":"4","track_uri":"spotify:track:0XLOf9LhyazPX9Ld8jPiUq"} {"_id":{"$oid":"5b2cff21aecd2a723459cd66"},"id":2,"sp_id":"7aF09WaavZAmAWuUeYxlYD","danceability":0.59299999999999997158,"energy":0.86799999999999999378,"key"1 “响度”: - 3.57299999999999999538,“模式”:“0”,“语音”:0.2949999999999998446,“声学”:0.182999999999999996,“乐器”:0.0,“活跃”:0.36499999999999999112,“价”:0.4959999999999999645,“速度”:104.9879999999999645,“节奏”:104.9879999999999645 ,"time_signature":"4","track_uri":"spotify:track:7aF09WaavZAmAWuUeYxlYD"} {"_id":{"$oid":"5b2cff21aecd2a723459cd67"},"id":3,"sp_id":"0tKcYR2II1VCQWT79i5NrW","danceability":0.5999999999999999778,"energy":0.81000000000000005329,"key":"0 “响度”: - 4.748999999999999666,“模式”:“1”,“语音”:0.04789999999999998135,“声学”:0.006830000000000000001335,“乐器”:0.2099999999999999223,“活跃”:0.11549999999999999889,“价”:0.2979999999999998712,“速度”:167.8799999999998712,“Tempo”:167.8799999999998712 ,"time_signature":"4","track_uri":"spotify:track:0tKcYR2II1VCQWT79i5NrW"} {"_id":{"$oid":"5b2cff21aecd2a723459cd68"},"id":4,"sp_id":"6TWSVHx6z6E42JiwloGv1k","danceability":0.5030000000000000266,"energy":0.9180000000000000019,"key"1" “响度”: - 5.0099999999999997868,“模式”:“1”,“语音”:0.04639999999999996803,“声学”:0.016199999999999999123,“乐器”:0.02440000000000000001549,“活跃”:0.1859999999999999867,“价”:0.4179999999999998268,“速度”:140.0 ,"time_signature":"4","track_uri":"spotify:track:6TWSVHx6z6E42JiwloGv1k"} {"_id":{"$oid":"5b2cff21aecd2a723459cd69"},"id":5,"sp_id":"5QqyRUZeBE04yJxsD1OC0I","danceability":0.7600000000000000888,"energy":0.56100000000000005418,"key":" “响度”: - 8.6969999999999991758,“模式”:“1”,“语音”:0.013400000000000000799,“声学”:0.018499999999999999084,“乐器”:1.9400000000000000604E-05,“活跃”:0.199000000000000000000021,“价”:0.12099999999999999645,“速度” ":134.98300000000000409,"time_signature":"4","track_uri":"spotify:track:5QqyRUZeBE04yJxsD1OC0I"}

【问题讨论】:

    标签: python json pandas jupyter-notebook


    【解决方案1】:

    您的 JSON 不会解析,因为它是无效的 JSON。解析器抱怨的字符就在第一个换行符之后。显然有对象逐行转储到文件中,它们一起不构成有效对象。见:

    >>> json.loads(s[:499])
    {'_id': {'$oid': '5b2cff21aecd2a723459cd65'},
     'id': 1,
     'sp_id': '0XLOf9LhyazPX9Ld8jPiUq',
     'danceability': 0.708,
     'energy': 0.61,
     'key': '2',
     'loudness': -4.522,
     'mode': '1',
     'speechiness': 0.0574,
     'acousticness': 0.0204,
     'instrumentalness': 4.45e-06,
     'liveness': 0.0641,
     'valence': 0.305,
     'tempo': 123.038,
     'time_signature': '4',
     'track_uri': 'spotify:track:0XLOf9LhyazPX9Ld8jPiUq'}
    >>> json.loads(s[499:973])
    {'_id': {'$oid': '5b2cff21aecd2a723459cd66'},
     'id': 2,
     'sp_id': '7aF09WaavZAmAWuUeYxlYD',
     'danceability': 0.593,
     'energy': 0.868,
     'key': '1',
     'loudness': -3.573,
     'mode': '0',
     'speechiness': 0.295,
     'acousticness': 0.183,
     'instrumentalness': 0.0,
     'liveness': 0.365,
     'valence': 0.496,
     'tempo': 104.988,
     'time_signature': '4',
     'track_uri': 'spotify:track:7aF09WaavZAmAWuUeYxlYD'}
    

    s 是加载到字符串中的示例输入。)这些对象一个接一个地打印到文件中。您要么必须更改语法,使其成为对象列表(添加方括号和逗号),要么逐行解析文件,在输入的每一行调用 json.loads

    现在,不要在这个问题上引用我的话,但是破解您的输入以使其成为有效的 JSON 非常容易:

    >>> len(json.loads('[' + s.replace('\n', ',') + ']'))
    5
    

    如果文件很大,由于会产生巨大的内存开销,您可能不想一次性执行上述 hack 和随后的解析。在这种情况下,我建议逐个对象解析您的文件对象。假设你的文件每行包含一个对象,你只需要

    dat = [json.loads(line) for line in open(infile)]
    

    infile 是您的串联 JSON 文件的路径。一个大文件需要很长时间,结果会占用大量内存,但我希望这种方式用于解析的额外开销会更少。

    【讨论】:

      【解决方案2】:

      看起来您正在从 MongoDB 数据库中读取记录。 结果是一行一行存储的 JSON 对象数组,这意味着它本身不是有效的 JSON 对象,正如@Andras 所指出的那样

      从 MongoDB 读取数据似乎会更有效率。

      您可以像这样使用 PyMongo:

      import pandas as pd
      from pymongo import MongoClient
      
      mdbClient = MongoClient('mongodb://localhost:27017/')
      db = mdbClient['db']
      collection = db['col']
      
      results = collection.find({})
      df = pd.DataFrame.from_records(results)
      

      【讨论】:

        猜你喜欢
        • 2016-12-26
        • 2020-04-17
        • 2020-10-23
        • 2020-01-03
        • 2018-07-14
        • 2023-01-16
        • 2018-03-04
        • 2017-10-21
        • 2021-07-06
        相关资源
        最近更新 更多