使用 Python 在数据文件中的 JSON 对象之间添加逗号？答案

【问题标题】：Adding a comma between JSON objects in a datafile with Python?使用 Python 在数据文件中的 JSON 对象之间添加逗号？
【发布时间】：2021-12-30 22:25:19
【问题描述】：

我有一个大文件（大约 3GB），其中包含看起来像 JSON 文件但不是因为它在“观察”或 JSON 对象之间缺少逗号（，）（我的这些“对象”中有大约 200 万个）数据文件）。

例如，这就是我所拥有的：

{
    "_id": {
        "$id": "fh37fc3huc3"
    },
    "messageid": "4757724838492485088139042828",
    "attachments": [],
    "usernameid": "47284592942",
    "username": "Alex",
    "server": "475774810304151552",
    "text": "Must watch",
    "type": "462050823720009729",
    "datetime": "2018-08-05T21:20:20.486000+00:00",
    "type": {
        "$numberLong": "0"
    }
}

{
    "_id": {
        "$id": "23453532dwq"
    },
    "messageid": "232534",
    "attachments": [],
    "usernameid": "273342",
    "usernameid": "Alice",
    "server": "475774810304151552",
    "text": "https://www.youtube.com/",
    "type": "4620508237200097wd29",
    "datetime": "2018-08-05T21:20:11.803000+00:00",
    "type": {
        "$numberLong": "0"
    }

这就是我想要的（“观察”之间的逗号）：

{
    "_id": {
        "$id": "fh37fc3huc3"
    },
    "messageid": "4757724838492485088139042828",
    "attachments": [],
    "username": "Alex",
    "server": "475774810304151552",
    "type": {
        "$numberLong": "0"
    }
},

{
    "_id": {
        "$id": "23453532dwq"
    },
    "messageid": "232534",
    "attachments": [],
    "usernameid": "Alice",
    "server": "475774810304151552",
    "type": {
        "$numberLong": "0"
    }

这是我尝试过的，但在我需要的地方没有逗号：

import re

with open('dataframe.txt', 'r') as input, open('out.txt', 'w') as output:
    output.write("[")
    for line in input:
        line = re.sub('', '},{', line)
        output.write('    '+line)
    output.write("]")

如何才能在数据文件中的每个 JSON 对象之间添加逗号？

【问题讨论】：

请注意，即使有逗号，它仍然不是合法的 JSON。
您所做拥有的是一个有效的 JSON 对象流。 jq 实用程序可以轻松地将其转换为单个 JSON 对象数组：jq -s '.' dataframe.txt。有一个用于 Python 的 jq 绑定库，但不幸的是，它需要将整个 JSON 读入内存。理想情况下，您会使用 Python 流式 JSON 库，它也可以处理对象流，但我没有任何好的建议。
如果该行只是一个右大括号而没有其他内容，请在后面添加一个逗号。但不要在最后一行这样做。
考虑到大小，如果它精确地采用这种格式，如果速度有任何问题，我会选择sed 或awk 而不是Python。按照约翰戈登的说法。修剪最后一个逗号或检查 sed awk 中的下一行是否有内容。

标签： python json python-re txt

【解决方案1】：

此解决方案的前提是 JSON 中的任何字段都不包含 { 或 }。

如果我们假设JSON字典之间至少有一个空行，一个想法：让我们保持未闭合的大括号计数（{）为unclosed_count；如果我们遇到一个空行，我们添加一次昏迷。

像这样：

with open('test.json', 'r') as input_f, open('out.json', 'w') as output_f:
    output_f.write("[")
    unclosed_count = 0
    comma_after_zero_added = True
    for line in input_f:
        unclosed_count_change = line.count('{') - line.count('}')
        unclosed_count += unclosed_count_change
        if unclosed_count_change != 0:
            comma_after_zero_added = False
        if line.strip() == '' and unclosed_count == 0 and not comma_after_zero_added:
            output_f.write(",\n")
            comma_after_zero_added = True
        else:
            output_f.write(line)
    output_f.write("]")

【讨论】：

如果json中的任何字段包含{或}怎么办？
看起来不像 OP 的情况，但事实是，如果发生这种情况 - 我的解决方案将无法正常工作。应该提到这一点，我会将其添加到我的答案中。

【解决方案2】：

假设内存足够，您可以直接使用json.JSONDecoder.raw_decode 一次解析一个对象，而不是使用json.loads。

>>> x = '{"a": 1}\n{"b": 2}\n'  # Hypothetical output of open("dataframe.txt").read()
>>> decoder = json.JSONDecoder()
>>> x = '{"a": 1}\n{"b":2}\n'
>>> decoder.raw_decode(x)
({'a': 1}, 8)
>>> decoder.raw_decode(x, 9)
({'b': 2}, 16)

raw_decode 的输出是一个元组，其中包含第一个解码的 JSON 值以及字符串中剩余数据开始的位置。（注意json.loads 只是创建了一个JSONDecoder 的实例，并调用decode 方法，该方法只是调用raw_decode 并在整个输入未被第一个解码值消耗时人为地引发异常。）

涉及一些额外的工作；请注意，您不能开始使用空格解码，因此您必须使用返回的索引来检测下一个值的开始位置，在返回索引处的任何其他空格之后。

【讨论】：

【解决方案3】：

查看数据的另一种方法是您有多个用空格分隔的 json 记录。您可以使用 stdlib JSONDecoder 读取每条记录，然后去除空格并重复直到完成。解码器从字符串中读取一条记录并告诉你它有多远。将其迭代地应用于数据，直到全部消耗完。这比对 json 本身包含哪些数据做出一堆假设风险要小得多。

import json

def json_record_reader(filename):
    with open(filename, encoding="utf-8") as f:
        txt = f.read().lstrip()
    decoder = json.JSONDecoder()
    result = []
    while txt:
        data, pos = decoder.raw_decode(txt)
        result.append(data)
        txt = txt[pos:].lstrip()
    return result
    
print(json_record_reader("data.json"))

考虑到文件的大小，内存映射文本文件可能是更好的选择。

【讨论】：

【解决方案4】：

如果您确定唯一可以找到空行的位置是两个字典之间，那么您可以在修复其执行后继续您当前的想法。对于每一行，检查它是否为空。如果不是，请按原样编写。如果是，请用逗号代替

with open('dataframe.txt', 'r') as input_file, open('out.txt', 'w') as output_file:
    output_file.write("[")
    for line in input_file:
        if line.strip():
            output_file.write(line)
        else:
            output_file.write(",")
    output_file.write("]")

如果您不能保证必须用逗号替换任何空行，则需要另一种方法。您想用},{ 替换一个右括号，后跟一个空行（或多个空格），然后是一个左括号。

除了当前行之外，您还可以跟踪前两行，如果依次是"}"、"" 和"{"，则在写"{" 之前写一个逗号。

from collections import deque

with open('dataframe.txt', 'r') as input_file, open('out.txt', 'w') as output_file:
    last_two_lines = deque(maxlen=2)
    output_file.write("[")
    for line in input_file:
        line_s = line.strip()
        if line_s == "{" and list(last_two_lines) == ["}", ""]:
            output_file.write("," + line)
        else:
            output_file.write(line)
        last_two_lines.append(line_s)

或者，如果你想坚持使用正则表达式，那么你可以这样做

with open('dataframe.txt') as input_file:
    file_contents = input_file.read()

repl_contents = re.sub(r'\}(\s+)\{', r'},\1{', file_contents)

with open('out.txt', 'w') as output_file:
    output_file.write(repl_contents)

这里，正则表达式 r"\}(\s+)\{" 匹配我们正在寻找的模式（\s+ 匹配多个空白字符，并将它们捕获到第 1 组中，然后我们在替换字符串中将其用作 \1。

请注意，您需要在整个文件上读取并运行re.sub，这会很慢。

【讨论】：

只是好奇，你有没有在 1GB+ 的字符串上成功运行过有意义的re.sub？我不是在开玩笑，真的很好奇。
@YevgeniyKosmak 不，我不会推荐它，这就是为什么它是我回答中的第二种（现在是第三种）方法，并且我警告说它会很慢。