使用正则表达式从 4 个列表创建多个字典答案

【问题标题】：Create multiple dictionaries from 4 lists using regex使用正则表达式从 4 个列表创建多个字典
【发布时间】：2021-01-07 15:37:24
【问题描述】：

我有以下 txt 文件：

197.109.77.178 - kertzmann3129 [21/Jun/2019:15:45:25 -0700] "DELETE /virtual/solutions/target/web+services HTTP/2.0" 203 26554
156.127.178.177 - okuneva5222 [21/Jun/2019:15:45:27 -0700] "DELETE /interactive/transparent/niches/revolutionize HTTP/1.1" 416 14701
100.32.205.59 - ortiz8891 [21/Jun/2019:15:45:28 -0700] "PATCH /architectures HTTP/1.0" 204 6048
168.95.156.240 - stark2413 [21/Jun/2019:15:45:31 -0700] "GET /engage HTTP/2.0" 201 9645
71.172.239.195 - dooley1853 [21/Jun/2019:15:45:32 -0700] "PUT /cutting-edge HTTP/2.0" 406 24498
180.95.121.94 - mohr6893 [21/Jun/2019:15:45:34 -0700] "PATCH /extensible/reinvent HTTP/1.1" 201 27330

我想创建一个函数，将这些转换为多个字典，其中每一行都是一个字典：

example_dict = {"host":"146.204.224.152", "user_name":"feest6811", "time":"21/Jun/2019:15:45:24 -0700", "request":"POST /incentivize HTTP/1.1"}

到目前为止，我能够做到这一点，为所有项目创建 4 个列表，但我不知道如何为每行创建多个 dic：

import re
def logs():
    with open("assets/logdata.txt", "r") as file:
        logdata = file.read()
        host = (re.findall('(.*?)\-',logdata))
        username = re.findall('\-(.*?)\[',logdata)
        time = re.findall('\[(.*?)\]', logdata)
        request = re.findall('\"(.*?)\"',logdata)
        #for line in range(len(logdata)):
            #dc = {'host':host[line], 'user_name':user_name[line], 'time':time[line], 'request':request[line]}

【问题讨论】：

那么你想要一个字典列表吗？您注释掉的代码的当前错误是什么？
它说语法错误。不，我想要文本文件的每一行都有一个字典
您能否将准确的语法错误复制并粘贴到问题中？
尝试将所有部分放入一个正则表达式中，每个组的位置将不可能出现误报，因为当您单独匹配一个小部分时。使用regex101.com 可以帮助您根据测试数据查看匹配组。
很抱歉说：“列表索引超出范围”

标签： python regex list dictionary

【解决方案1】：

一旦您解决了您遇到的正则表达式问题 - 下面的代码将为您工作

import re

result = []
with open('data.txt') as f:
    lines = [l.strip() for l in f.readlines()]
    for logdata in lines:
      host = (re.findall('(.*?)\-',logdata))
      username = re.findall('\-(.*?)\[',logdata)
      _time = re.findall('\[(.*?)\]', logdata)
      request = re.findall('\"(.*?)\"',logdata)
      result.append({'host':host,'user_name':username,'time':_time,
    'request':request})
print(result)

【讨论】：

【解决方案2】：

使用str.split() 和str.index() 也可以，忽略正则表达式的需要。同样，您可以直接遍历文件处理程序，它会逐行生成一行，因此您不必将整个文件加载到内存中：

result = []

with open('logdata.txt') as f:
    for line in f:
        # Isolate host and user_name, discarding the dash in between
        host, _, user_name, remaining = line.split(maxsplit=3)

        # Find the end of the datetime and isolate it
        end_bracket = remaining.index(']')
        time_ = remaining[1:end_bracket]

        # Slice out the time from the request and strip the ending newline
        request = remaining[end_bracket + 1:].strip()

        # Create the dictionary
        result.append({
            'host': host,
            'user_name': user_name,
            'time': time_,
            'request': request
        })

from pprint import pprint
pprint(result)

【讨论】：

【解决方案3】：

以下代码 sn-p 将生成一个字典列表，日志文件中的每一行对应一个。

import re


def parse_log(log_file):
    regex  = re.compile(r'^([0-9\.]+) - (.*) \[(.*)\] (".*")')
    
    def _extract_field(match_object, tag, index, result):
        if match_object[index]:
            result[tag] = match_object[index]

    result = []
    with open(log_file) as fh:
        for line in fh:
            match = re.search(regex, line)
            if match:
                fields = {}
                _extract_field(match, 'host'     , 1, fields)
                _extract_field(match, 'user_name', 2, fields)
                _extract_field(match, 'time'     , 3, fields)
                _extract_field(match, 'request'  , 4, fields)
            result.append(fields)

    return result


def main():
    result = parse_log('log.txt')

    for line in result:
        print(line)


if __name__ == '__main__':
    main()

【讨论】：

【解决方案4】：

以下函数根据您的原始问题返回一个字典列表，其中包含与assets/logdata.txt 的每一行匹配的所需键/值。

值得注意的是，应在此基础上实施适当的错误处理，因为存在可能导致代码执行意外停止的明显边缘情况。

请注意您的host 模式的更改，这很重要。您的示例中使用的原始模式不仅匹配每行的 host 部分，在模式开头添加一个锚点旁边 re.MULTILINE 停止匹配误报，该误报将从每个行的其余部分匹配与原始示例中的行一样。

import re
def logs():
    with open("assets/logdata.txt", "r") as file:
        logdata = file.read()
    host = (re.findall('^(.*?)\-',logdata, re.MULTILINE))
    username = re.findall('\-(.*?)\[',logdata)
    time = re.findall('\[(.*?)\]', logdata)
    request = re.findall('\"(.*?)\"',logdata)
    return [{ "host": host[i].strip(), "username": username[i], "time": time[i], "request": request[i] } for i,h in enumerate(host)]

以上是基于您的原始帖子的简单/最小的解决方案。有很多更清洁和更有效的方法可以解决这个问题，但是我认为从现有代码中工作可以让您了解如何纠正它 - 而不仅仅是为您提供更好的优化解决方案，可能对你来说意义不大。

【讨论】：

【解决方案5】：

我现在正在做这门课程，我得到的答案是

import re
def logs():
with open("assets/logdata.txt", "r") as file:
    logdata = file.read()

# YOUR CODE HERE

pattern='''
(?P<host>[\w.]*)
(\ -\ )
(?P<user_name>([a-z\-]*[\d]*))
(\ \[)
(?P<time>\w.*?)
(\]\ \")
(?P<request>\w.*)
(\")
'''

lst=[]

for item in re.finditer(pattern,logdata,re.VERBOSE):
    lst.append(item.groupdict())
print(lst)
return lst

【讨论】：