如何使用 Python ijson 读取大型 JSON 文件？答案

【问题标题】：How to read a large JSON file using Python ijson?如何使用 Python ijson 读取大型 JSON 文件？
【发布时间】：2018-08-14 22:43:03
【问题描述】：

我正在尝试解析一个大的 json 文件（数百个演出）以从其键中提取信息。为简单起见，请考虑以下示例：

import random, string

# To create a random key 
def random_string(length):
        return "".join(random.choice(string.lowercase) for i in range(length))

# Create the dicitonary 
dummy = {random_string(10): random.sample(range(1, 1000), 10) for times in range(15)}

# Dump the dictionary into a json file 
with open("dummy.json", "w") as fp:
        json.dump(dummy, fp)

然后，我在 python 2.7 中使用 ijson 来解析文件：

file_name = "dummy.json"

with open(file_name, "r") as fp:

    for key in dummy.keys():

        print "key: ", key 

        parser = ijson.items(fp, str(key) + ".item")

        for number in parser:
            print number,

我期待检索列表中与 dic 的键对应的所有数字。但是，我得到了

IncompleteJSONError: 不完整的 JSON 数据

我知道这篇文章：Using python ijson to read a large json file with multiple json objects，但在我的情况下，我有一个格式良好的 json 文件，具有相对简单的架构。关于如何解析它的任何想法？谢谢你。

【问题讨论】：

标签： python json python-2.7 ijson

【解决方案1】：

ijson 有一个迭代器接口来处理大型 JSON 文件，允许懒惰地读取文件。您可以分小块处理文件并将结果保存在其他地方。

调用ijson.parse() 产生三个值prefix, event, value

一些 JSON：

{
    "europe": [
      {"name": "Paris", "type": "city"},
      {"name": "Rhein", "type": "river"}
    ]
  }

代码：

import ijson


data = ijson.parse(open(FILE_PATH, 'r'))

for prefix, event, value in data:
    if event == 'string':
        print(value)

输出：

Paris
city
Rhein
river

参考：https://pypi.python.org/pypi/ijson

【讨论】：

上面的例子产生了一个字典，解析器产生了我描述的错误。这不一样。
不能将 ijson.items 用于大文件，它不会读取整个文件并会抛出错误
对于大文件，您需要仔细使用 ijson.items() 或 ijson.parse() 返回的生成器，例如您应该避免通过set(your_generator) 或list(your_generator) 获取值

【解决方案2】：

示例json 内容文件如下：它有两个人的记录。它还可能有 200 万条记录。

    [
      {
        "Name" : "Joy",
        "Address" : "123 Main St",
        "Schools" : [
          "University of Chicago",
          "Purdue University"
        ],
        "Hobbies" : [
          {
            "Instrument" : "Guitar",
            "Level" : "Expert"
          },
          {
            "percussion" : "Drum",
            "Level" : "Professional"
          }
        ],
        "Status" : "Student",
        "id" : 111,
        "AltID" : "J111"
      },
      {
        "Name" : "Mary",
        "Address" : "452 Jubal St",
        "Schools" : [
          "University of Pensylvania",
          "Washington University"
        ],
        "Hobbies" : [
          {
            "Instrument" : "Violin",
            "Level" : "Expert"
          },
          {
            "percussion" : "Piano",
            "Level" : "Professional"
          }
        ],
        "Status" : "Employed",
        "id" : 112,
        "AltID" : "M112"
      }
      }
    ]

我创建了一个生成器，它将每个人的记录作为json 对象返回。代码如下所示。这不是生成器代码。更改几行将使其成为生成器。

import json

curly_idx = []
jstr = ""
first_curly_found = False
with open("C:\\Users\\Rajeshs\\PycharmProjects\\Project1\\data\\test.json", 'r') as fp:
    #Reading file line by line
    line = fp.readline()
    lnum = 0
    while line:
        for a in line:
            if a == '{':
                curly_idx.append(lnum)
                first_curly_found = True
            elif a == '}':
                curly_idx.pop()

        # when the right curly for every left curly is found,
        # it would mean that one complete data element was read
        if len(curly_idx) == 0 and first_curly_found:
            jstr = f'{jstr}{line}'
            jstr = jstr.rstrip()
            jstr = jstr.rstrip(',')
            jstr[:-1]
            print("------------")
            if len(jstr) > 10:
                print("making json")
                j = json.loads(jstr)
            print(jstr)
            jstr = ""
            line = fp.readline()
            lnum += 1
            continue

        if first_curly_found:
            jstr = f'{jstr}{line}'

        line = fp.readline()
        lnum += 1
        if lnum > 100:
            break

【讨论】：

【解决方案3】：

您正在使用同一个文件对象启动多个解析迭代，而无需重置它。第一次调用 ijson 将起作用，但会将文件对象移动到文件末尾；然后第二次将 same.object 传递给 ijson 它会抱怨，因为不再从文件中读取任何内容。

尝试每次调用 ijson 时打开文件；或者，您可以在调用 ijson 后查找文件的开头，以便文件对象可以再次读取您的文件数据。

【讨论】：

【解决方案4】：

if you are working with json with the following format you can use ijson.item()



sample json:

[
    {"id":2,"cost":0,"test":0,"testid2":255909890011279,"test_id_3":0,"meeting":"daily","video":"paused"}
    {"id":2,"cost":0,"test":0,"testid2":255909890011279,"test_id_3":0,"meeting":"daily","video":"paused"}

]





  input = 'file.txt'
        res=[]
        if Path(input).suffix[1:].lower() == 'gz':
            input_file_handle = gzip.open(input, mode='rb')
        else:
            input_file_handle = open(input, 'rb')

        for json_row in ijson.items(input_file_handle,
                                    'item'):
            res.append(json_row)

【讨论】：