在包含 JSON 和文本结构的 txt 文件中解析 JSON 结构答案

【问题标题】：Parse JSON structures in a txt file containing JSON and text structures在包含 JSON 和文本结构的 txt 文件中解析 JSON 结构
【发布时间】：2019-06-24 01:49:42
【问题描述】：

我有一个带有 json 结构的 txt 文件。问题是该文件不仅包含 json 结构，还包含原始文本，例如日志错误：

2019-01-18 21:00:05.4521|INFO|Technical|Batch Started|
2019-01-18 21:00:08.8740|INFO|Technical|Got Entities List from 20160101 00:00 : 
{
"name": "1111",
"results": [{
    "filename": "xxxx",
    "numberID": "7412"
}, {
    "filename": "xgjhh",
    "numberID": "E52"
}]
}

2019-01-18 21:00:05.4521|INFO|Technical|Batch Started|
2019-01-18 21:00:08.8740|INFO|Technical|Got Entities List from 20160101 00:00 :
{
"name": "jfkjgjkf",
"results": [{
    "filename": "hhhhh",
    "numberID": "478962"
}, {
    "filename": "jkhgfc",
    "number": "12544"
}]
}

我阅读了 .txt 文件，但尝试修补 jason 结构时出现错误：在：

import json
with open("data.txt", "r", encoding="utf-8", errors='ignore') as f:
   json_data = json.load(f)

输出：json.decoder.JSONDecodeError：额外数据：第 1 行第 5 列（字符 4）

我想打包 json 并保存为 csv 文件。

【问题讨论】：

是否每个非 json 行都以日期和时间开头？您可以使用正则表达式查找以"{number}-{number}-{number} " 开头的所有行并将它们之间的所有行传递给json.loads()
第一行是日期时间，在每个 json 结构之间，它以字符 | 开头。然后是日期时间，如下所示：|2019-01-18 21:00:11.7022|INFO|技术|在 0.43 秒内获得 372245 的实体配置文件| 2019-01-18 21:00:11.8897|INFO|技术|获得实体 372514 的以下配置文件：{ 和另一个 json 结构开始 ...

标签： python json

【解决方案1】：

您可以计算文件中的大括号以查找 json 的开头和结尾，并将它们存储在列表中，此处为 found_jsons。

import json

open_chars = 0
saved_content = []

found_jsons = []

for i in content.splitlines():
    open_chars += i.count('{')

    if open_chars:
        saved_content.append(i)

    open_chars -= i.count('}')


    if open_chars == 0 and saved_content:
        found_jsons.append(json.loads('\n'.join(saved_content)))
        saved_content = []


for i in found_jsons:
    print(json.dumps(i, indent=4))

输出

{
    "results": [
        {
            "numberID": "7412",
            "filename": "xxxx"
        },
        {
            "numberID": "E52",
            "filename": "xgjhh"
        }
    ],
    "name": "1111"
}
{
    "results": [
        {
            "numberID": "478962",
            "filename": "hhhhh"
        },
        {
            "number": "12544",
            "filename": "jkhgfc"
        }
    ],
    "name": "jfkjgjkf"
}

【讨论】：

请注意，这仅在非 JSON 行不包含大括号时才有效。
如果 JSON 包含一个带有不匹配大括号的字符串，这也会中断。其中is possible in the filename field.

【解决方案2】：

你可以做以下几件事之一：

在命令行中，删除所有包含“|INFO|Technical|”的行出现（假设这出现在每一行原始文本中）：
sed -i '' -e '/\|INFO\|Technical/d' yourfilename（如果在 Mac 上），
sed -i '/\|INFO\|Technical/d' yourfilename（如果在 Linux 上）。
将这些原始行移到它们自己的 JSON 字段中

【讨论】：

我想在 python 中读取 json 之前使用 sed 作为预处理步骤将比在 python 中执行所有操作具有更高的性能。
通过删除所有已知的非 JSON 内容，文件的其余部分将无法解析为有效的 JSON 对象，因为它实际上会变成多个 JSON 对象。另请注意，该概念仅在非 JSON 内容遵循已知且一致的模式时才有效。
你仍然需要从一个文件中读取多个json对象，即kind of a pain。

【解决方案3】：

在不假设非 JSON 内容的情况下解析包含 JSON 对象与其他内容混合的文件的更通用解决方案是通过大括号将文件内容拆分为片段，从第一个片段开始花括号，然后将其余的片段一一连接，直到连接的字符串可解析为 JSON：

import re

fragments = iter(re.split('([{}])', f.read()))
while True:
    try:
        while True:
            candidate = next(fragments)
            if candidate == '{':
                break
        while True:
            candidate += next(fragments)
            try:
                print(json.loads(candidate))
                break
            except json.decoder.JSONDecodeError:
                pass
    except StopIteration:
        break

这个输出：

{'name': '1111', 'results': [{'filename': 'xxxx', 'numberID': '7412'}, {'filename': 'xgjhh', 'numberID': 'E52'}]}
{'name': 'jfkjgjkf', 'results': [{'filename': 'hhhhh', 'numberID': '478962'}, {'filename': 'jkhgfc', 'number': '12544'}]}

【讨论】：

【解决方案4】：

使用“文本结构”作为 JSON 对象之间的分隔符。

遍历文件中的行，将它们保存到缓冲区，直到遇到作为文本行的行，此时解析已保存为 JSON 对象的行。

import re
import json

def is_text(line):
    # returns True if line starts with a date and time in "YYYY-MM-DD HH:MM:SS" format
    line = line.lstrip('|') # you said some lines start with a leading |, remove it
    return re.match("^(\d{4})-(\d{2})-(\d{2}) (\d{2}):(\d{2}):(\d{2})", line)

json_objects = []

with open("data.txt") as f:
    json_lines = []

    for line in f:
        if not is_text(line):
            json_lines.append(line)
        else:
            # if there's multiple text lines in a row json_lines will be empty
            if json_lines:
                json_objects.append(json.loads("".join(json_lines)))
                json_lines = []

    # we still need to parse the remaining object in json_lines
    # if the file doesn't end in a text line
    if json_lines:
        json_objects.append(json.loads("".join(json_lines)))

print(json_objects)

在最后两行重复逻辑有点难看，但是您需要处理文件中最后一行不是文本行的情况，因此当您完成 for 循环后，您需要解析json_lines 中的最后一个对象（如果有的话）。

我假设文本行之间的 JSON 对象永远不会超过一个，而且我的日期正则表达式将在 8000 年后中断。

【讨论】：

【解决方案5】：

此解决方案将去除非 JSON 结构，并将它们包装在包含 JSON 结构中。这应该可以为您完成工作。为了方便起见，我将其发布，然后我将编辑我的答案以获得更清晰的解释。完成后我会先编辑一下：

import json

with open("data.txt", "r", encoding="utf-8", errors='ignore') as f:
    cleaned = ''.join([item.strip() if item.strip() is not '' else '-split_here-' for item in f.readlines() if '|INFO|' not in item]).split('-split_here-')

json_data = json.loads(json.dumps(('{"entries":[' + ''.join([entry + ', ' for entry in cleaned])[:-2] + ']}')))

输出：

{"entries":[{"name": "1111","results": [{"filename": "xxxx","numberID": "7412"}, {"filename": "xgjhh","numberID": "E52"}]}, {"name": "jfkjgjkf","results": [{"filename": "hhhhh","numberID": "478962"}, {"filename": "jkhgfc","number": "12544"}]}]}

这是怎么回事？

在cleaned = ... 行中，我们使用list comprehension 创建文件(f.readlines()) 中不包含字符串|INFO| 的行的list，并添加字符串@987654328每当有空行时，@ 到列表中（.strip() 产生 ''）。

然后，我们将 list 行数 (''.join()) 转换为 string。

最后，我们将该字符串 (.split('-split_here-') 转换为列表的 list，将 JSON 结构分离为它们自己的 lists，并在 data.txt 中用空行标记。

在 json_data = ... 行中，我们使用列表推导向每个 JSON 结构附加一个 ', '。

然后，我们将 list 转换回单个 string，剥离最后一个 ', '（.join()[:-2].[:-2]字符串中最后两个字符的切片。）。

然后我们用'{"entries":[' 和']}' 包装字符串以使整个内容成为有效的JSON 结构，并将其提供给json.dumps 和json.loads 以清除任何编码并将数据加载到python 对象中。

【讨论】：