解析文件中所有出现的字符串并在 JSON 中生成键值答案

【问题标题】：Parse a file for all occurrences of a string and generate key-values in JSON解析文件中所有出现的字符串并在 JSON 中生成键值
【发布时间】：2017-12-08 19:02:49
【问题描述】：

我有一个文件 (https://pastebin.com/STgtBRS8)，我需要在其中搜索所有出现的单词“silencedetect”。
然后我必须生成一个 JSON 文件，其中包含“silence_start”、“silence_end”和“silence_duration”的键值。

JSON 文件应如下所示：

[
{
"id": 1,
"silence_start": -0.012381,
"silence_end": 2.2059,
"silence_duration": 2.21828
},
{
"id": 2,
"silence_start": 5.79261,
"silence_end": 6.91955,
"silence_duration": 1.12694,
}
]

这是我尝试过的：

with open('volume_data.csv', 'r') as myfile:
    data = myfile.read().replace('\n', '')

for line in data:
    if "silencedetect" in data:
        #read silence_start, silence_end, and silence_duration and put in json

我无法将 3 个键值对与每个“silencedetect”相关联。如何解析键值并以 JSON 格式获取它们？

【问题讨论】：

看起来不像 csv 文件
@RomanPerekhrest：是的，但我认为它是一个。它也可以是 .txt。暂时忽略扩展。

标签： python json key-value

【解决方案1】：

您可以对其进行正则表达式。它对我有用

import re

with open('volume_data.csv', 'r') as myfile:
    data = myfile.read()

d = re.findall('silence_start: (-?\d+\.\d+)\n.*?\n?\[silencedetect @ \w{14}\] silence_end: (-?\d+\.\d+) \| silence_duration: (-?\d+\.\d+)', data)
print d

你可以把它们放在一个json中

out = [{'id': i, 'start':a[0], 'end':a[1], 'duration':a[2]} for i, a in enumerate(d)]
import json
print json.dumps(out) # or write to file or... whatever

输出：

'[{"duration": "2.21828", "start": "-0.012381", "end": "2.2059", "id": 0}, {"duration": "1.12694", "start": "5.79261", "end": "6.91955", "id": 1}, {"duration": "0.59288", "start": "8.53256", "end": "9.12544", "id": 2}, {"duration": "1.0805", "start": "9.64712", "end": "10.7276", "id": 3}, {"duration": "1.03406", "start": "12.6657", "end": "13.6998", "id": 4}, {"duration": "0.871519", "start": "19.2602", "end": "20.1317", "id": 5}'

编辑：修复了由于frame=.. 位于比赛开始和结束之间而错过一些比赛的错误

【讨论】：

【解决方案2】：

使用re.findall 和enumerate 函数的复杂解决方案：

import re, json

with open('volume_data.txt', 'r') as f:
    result = []
    pat = re.compile(r'(silence_start: -?\d+\.\d+).+?(silence_end: -?\d+\.\d+).+?(silence_duration: -?\d+\.\d+)')
    silence_items = re.findall(pat, f.read().replace('\n', ''))
    for i,v in enumerate(silence_items):
        d = {'id': i+1}
        d.update({pair[:pair.find(':')]: float(pair[pair.find(':')+2:]) for pair in v})
        result.append(d)

    print(json.dumps(result, indent=4))

输出：

[
    {
        "id": 1,
        "silence_end": 2.2059,
        "silence_duration": 2.21828,
        "silence_start": -0.012381
    },
    {
        "id": 2,
        "silence_end": 6.91955,
        "silence_duration": 1.12694,
        "silence_start": 5.79261
    },
    {
        "id": 3,
        "silence_end": 9.12544,
        "silence_duration": 0.59288,
        "silence_start": 8.53256
    },
    {
        "id": 4,
        "silence_end": 10.7276,
        "silence_duration": 1.0805,
        "silence_start": 9.64712
    },
    {
        "id": 5,
        "silence_end": 13.6998,
        "silence_duration": 1.03406,
        "silence_start": 12.6657
    },
    {
        "id": 6,
        "silence_end": 20.1317,
        "silence_duration": 0.871519,
        "silence_start": 19.2602
    },
    {
        "id": 7,
        "silence_end": 22.4305,
        "silence_duration": 0.801859,
        "silence_start": 21.6286
    },
    ...
]

【讨论】：

我不知道indent=4，这很酷。出于兴趣，您找到了多少条记录？
@Stael，最后一个id 是"id": 189
我错过了 4 因为 frame=... 线落在中间
@RomanPerekhrest，感谢您提供如此简洁的解决方案。我想知道，JSON 是否保留键值顺序。例如，我按以下顺序获取每个元素： { "silence_end": 596.869, "silence_duration": 0.825079, "id": 139, "silence_start": 596.044 } 我可以按以下顺序获取它：ID，silence_start，沉默结束，沉默持续时间？
@Mahesh 您不需要这样做，请参阅：stackoverflow.com/questions/4515676/…

【解决方案3】：

假设您的数据是有序的，您可以简单地对其进行流式解析，根本不需要正则表达式和加载整个文件：

import json

parsed = []  # a list to hold our parsed values
with open("entries.dat", "r") as f:  # open the file for reading
    current_id = 1  # holds our ID
    entry = None  # holds the current parsed entry
    for line in f:  # ... go through the file line by line
        if line[:14] == "[silencedetect":  # parse the lines starting with [silencedetect
            if entry:  # we already picked up silence_start
                index = line.find("silence_end:")  # find where silence_end starts
                value = line[index + 12:line.find("|", index)].strip()  # the number after it
                entry["silence_end"] = float(value)  # store the silence_end
                # the following step is optional, instead of parsing you can just calculate
                # the silence_duration yourself with:
                # entry["silence_duration"] = entry["silence_end"] - entry["silence_start"]
                index = line.find("silence_duration:")  # find where silence_duration starts
                value = line[index + 17:].strip()  # grab the number after it
                entry["silence_duration"] = float(value)  # store the silence_duration
                # and now that we have everything...
                parsed.append(entry)  # add the entry to our parsed list
                entry = None  # blank out the entry for the next step
            else:  # find silence_start first
                index = line.find("silence_start:")  # find where silence_start, well, starts
                value = line[index + 14:].strip()  # grab the number after it
                entry = {"id": current_id}  # store the current ID...
                entry["silence_start"] = float(value)  # ... and the silence_start
                current_id += 1  # increase our ID value for the next entry

# Now that we have our data, we can easily turn it into JSON and print it out if needed
your_json = json.dumps(parsed, indent=4)  # holds the JSON, pretty-printed
print(your_json)  # let's print it...

你会得到：

[
    {
        "silence_end": 2.2059, 
        "silence_duration": 2.21828, 
        "id": 1, 
        "silence_start": -0.012381
    }, 
    {
        "silence_end": 6.91955, 
        "silence_duration": 1.12694, 
        "id": 2, 
        "silence_start": 5.79261
    }, 
    {
        "silence_end": 9.12544, 
        "silence_duration": 0.59288, 
        "id": 3, 
        "silence_start": 8.53256
    }, 
    {
        "silence_end": 10.7276, 
        "silence_duration": 1.0805, 
        "id": 4, 
        "silence_start": 9.64712
    }, 
    # 
    # etc.
    # 
    {
        "silence_end": 795.516, 
        "silence_duration": 0.68576, 
        "id": 189, 
        "silence_start": 794.83
    }
]

请记住，JSON 不订阅数据顺序（v3.5 之前的 Python dict 也不订阅）所以id 不一定会出现在第一位，但数据有效性是一样的。

我特意分离了最初的 entry 创建，以便您可以使用 collections.OrderedDict 作为替代品（即entry = collections.OrderedDict({"id": current_id})）来保留您想要的顺序。

【讨论】：

【解决方案4】：

重新导入导入json

使用 open('volume_data.csv', 'r') 作为我的文件：数据 = myfile.read()

matcher = re.compile('(?P<g1>[silencedetect @ \w+?\])\s+?silence_start:\s+?(?P<g2>-?\d+?\.\d+?).*?\n([^\[]+?\n)?(?P=g1)\s+?silence_end:\s+?(?P<g3>-?\d+?\.\d+?).+?\|\s+?silence_duration:\s+?(?P<g4>-?\d+?\.\d+?).*?\n')
matchiter= matcher.findall(data)
#(1) (2)
string=""
for i, matchediter in enumerate( matchiter):
    string+= '{"id": {},\n, "silence_start":{},\n"silence_end": {},\n"silence_duration":{}}'. format(i, matchediter.group(g2),matchediter.group(g3),matchediter.group(g4)).

json.dumps(string)

(1) 您可能希望传递一些标志，例如“re.IGNORECASE”，以使您的脚本不受此类更改的影响。

(2) 我更喜欢使用非贪婪序列识别模式，它可能会对识别和速度产生影响。使用命名组是个人喜好问题。如果您决定改为使用 matcher.sub 操作来立即重新格式化 read()，而不是使用迭代来重建文件文本，它们可能会很有用。如果您无法弄清楚，我可以添加替换字符串。否则我更喜欢使用匹配对象的 .group，它是为此而设计的，并且可以使用您将选择的名称而不是 g1、g2、g3、g4。

总的来说，我更喜欢使用 finditer，因为它基本上是为这种操作而设计的，findall 会产生捕获组的元组，这很好，但您有时可能希望在分析中使用与完整匹配、模式、位置索引相关的信息文字等

编辑：我使正则表达式对在持续时间数字之后添加的任何字符串以及多个空格都具有鲁棒性。我还考虑了插入线，如果需要，您可以通过命名组来捕获它们。它捕获了 189 次出现，有 190 次“静音开始”，但最后一次没有结束和持续时间信息。

【讨论】：

你已经完成了我一开始所做的事情。注意有一行看起来像[silencedetect @ 0x7fe7a4f00ac0] silence_start: 732.925 frame=22123 fps=1001 q=-0.0 size=N/A time=00:12:18.17 bitrate=N/A speed=33.4x [silencedetect @ 0x7fe7a4f00ac0] silence_end: 738.673 | silence_duration: 5.74771，我认为你不匹配？
另外，即使出现了一些语法错误（有一个 ' 不合适，你忘记了 \ 你的一个 [）我得到bad character in backref group name '<g1>'
我没有以你的帖子为例，我花了很长时间才正确地写出我的帖子，实际上，因为我更习惯于在这里提问而不是回答这些问题。你在这条线上有一个相关点，我无法匹配。但无论如何，出于一致性，我更喜欢使用静音检测部分开始我的正则表达式。如果可以的话，我会再次编辑我的帖子，以考虑补充字符。对于错误，似乎我没有从剪贴板中复制正确的表达式。傻我。
:) 别担心，我们都会犯错误——我实际上是说你（单独）对我犯的错误犯了同样的错误。很高兴听到人们从提问转向回答 - 我主要是回答问题，因为它迫使我学习东西，所以我希望你喜欢它。
我愿意。实际上，我不打算放弃我的询问计划，只是因为我对正则表达式的了解足够多，因此通常会像这样表现自己；）