【问题标题】:Reading logs using regular expression使用正则表达式读取日志
【发布时间】:2019-07-14 03:24:43
【问题描述】:

我有一个 .txt 文件,其中包含以下格式的请求日志:

time_namelookup: 0,121668 
time_connect: 0,460643 
time_pretransfer: 0,460755 
time_redirect: 0,000000 
time_starttransfer: 0,811697 
time_total: 0,811813 
-------------
time_namelookup: 0,121665 
time_connect: 0,460643 
time_pretransfer: 0,460355 
time_redirect: 0,000000 
time_starttransfer: 0,813697 
time_total: 0,811853 
-------------
time_namelookup: 0,121558 
time_connect: 0,463243 
time_pretransfer: 0,460755 
time_redirect: 0,000000 
time_starttransfer: 0,911697 
time_total: 0,811413 

我想为每个类别创建一个值列表,因此我认为正则表达式在这种情况下可能相关。

import re

'''
In this exmaple, I save only the 'time_namelookup' parameter
The same logic adapted for other parameters.
'''

namelookup = []
with open('shaghai_if_config_test.txt', 'r') as fh:
     for line in fh.readlines():
         number_match = re.match('([+-]?([0-9]*[,])?[0-9]+)',line)
         namelookup_match = re.match('^time_namelookup:', line)
         if namelookup_match and number_match:
             num = number_match.group(0)
             namelookup.append(num)
             continue

我发现这个逻辑非常复杂,因为我必须执行两个正则表达式匹配。此外,number_match 参数与行不匹配,而^time_namelookup: ([+-]?([0-9]*[,])?[0-9]+) 工作正常

我正在为所描述的程序寻找有经验的建议。任何建议表示赞赏。

【问题讨论】:

    标签: python regex logging


    【解决方案1】:

    我的猜测是你设计了一个很好的表达式,我们可能会稍微修改一下:

    (time_(?:namelookup|connect|pretransfer|redirect|starttransfer|total))\s*:\s*([+-]?(?:\d*,)?\d+)
    

    re.findall测试:

    import re
    
    regex = r"(time_(?:namelookup|connect|pretransfer|redirect|starttransfer|total))\s*:\s*([+-]?(?:\d*,)?\d+)"
    
    test_str = ("time_namelookup: 0,121668 \n"
        "time_connect: 0,460643 \n"
        "time_pretransfer: 0,460755 \n"
        "time_redirect: 0,000000 \n"
        "time_starttransfer: 0,811697 \n"
        "time_total: 0,811813 \n")
    
    print(re.findall(regex, test_str))
    

    输出

    [('time_namelookup', '0,121668'), ('time_connect', '0,460643'), ('time_pretransfer', '0,460755'), ('time_redirect', '0,000000'), ('time_starttransfer', '0,811697'), ('time_total', '0,811813')]
    

    re.finditer测试:

    import re
    
    regex = r"(time_(?:namelookup|connect|pretransfer|redirect|starttransfer|total))\s*:\s*([+-]?(?:\d*,)?\d+)"
    
    test_str = ("time_namelookup: 0,121668 \n"
        "time_connect: 0,460643 \n"
        "time_pretransfer: 0,460755 \n"
        "time_redirect: 0,000000 \n"
        "time_starttransfer: 0,811697 \n"
        "time_total: 0,811813 \n"
        "-------------\n"
        "time_namelookup: 0,121665 \n"
        "time_connect: 0,460643 \n"
        "time_pretransfer: 0,460355 \n"
        "time_redirect: 0,000000 \n"
        "time_starttransfer: 0,813697 \n"
        "time_total: 0,811853 \n"
        "-------------\n"
        "time_namelookup: 0,121558 \n"
        "time_connect: 0,463243 \n"
        "time_pretransfer: 0,460755 \n"
        "time_redirect: 0,000000 \n"
        "time_starttransfer: 0,911697 \n"
        "time_total: 0,811413 ")
    
    matches = re.finditer(regex, test_str, re.MULTILINE)
    
    for matchNum, match in enumerate(matches, start=1):
    
        print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
    
        for groupNum in range(0, len(match.groups())):
            groupNum = groupNum + 1
    
            print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))
    

    表达式在this demo 的右上方面板中进行了解释,如果您想探索/简化/修改它。

    正则表达式电路

    jex.im 可视化正则表达式:

    【讨论】:

      【解决方案2】:

      您可以通过在捕获左侧的列表上循环来使其更容易:

      import re
      
      lst = ['time_namelookup', 'time_connect', 'time_pretransfer', 'time_redirect', 'time_starttransfer', 'time_total']
      
      result = []
      for x in lst:
          result.append(re.findall(f'{x}: (.*)', s))
      
      print(result)
      

      s 是您的文本文件数据。

      【讨论】:

        【解决方案3】:

        您还可以将itertools.groupbystr.split 应用于非正则表达式解决方案:

        from itertools import groupby
        data = [i.strip('\n') for i in open('filename.txt')]
        new_data = [[a, list(b)] for a, b in groupby(data, key=lambda x:x.startswith('time'))]
        results = [dict(i.split(': ') for i in b) for a, b in new_data if a]
        

        输出:

        [{'time_namelookup': '0,121668 ', 'time_connect': '0,460643 ', 'time_pretransfer': '0,460755 ', 'time_redirect': '0,000000 ', 'time_starttransfer': '0,811697 ', 'time_total': '0,811813 '}, 
         {'time_namelookup': '0,121665 ', 'time_connect': '0,460643 ', 'time_pretransfer': '0,460355 ', 'time_redirect': '0,000000 ', 'time_starttransfer': '0,813697 ', 'time_total': '0,811853 '}, 
         {'time_namelookup': '0,121558 ', 'time_connect': '0,463243 ', 'time_pretransfer': '0,460755 ', 'time_redirect': '0,000000 ', 'time_starttransfer': '0,911697 ', 'time_total': '0,811413 '}]
        

        【讨论】:

          【解决方案4】:

          如果格式这么简单,这里还有一个想法 - 使用 CSV 解析器读取文件,使用冒号作为分隔符。示例:

          import csv
          import itertools
          from pprint import pprint as print
          
          file = 'log.txt'
          with open(file) as fp:
              reader = csv.reader(fp, delimiter=':')
              # filter out delimiter lines
              rows = [r for r in reader if len(r) == 2]
              # group pairs by first element to a dict of lists
              grouped = {k: [x[1] for x in v] for k, v
                         in itertools.groupby(sorted(rows), key=lambda x: x[0])}
              print(grouped)
          

          会给你:

          {'time_connect': [' 0.460643 ', ' 0.460643 ', ' 0.463243 '],
           'time_namelookup': [' 0.121558 ', ' 0.121665 ', ' 0.121668 '],
           'time_pretransfer': [' 0.460355 ', ' 0.460755 ', ' 0.460755 '],
           'time_redirect': [' 0.000000 ', ' 0.000000 ', ' 0.000000 '],
           'time_starttransfer': [' 0.811697 ', ' 0.813697 ', ' 0.911697 '],
           'time_total': [' 0.811413 ', ' 0.811813 ', ' 0.811853 ']}
          

          如果您需要进一步处理,请在字典理解中进行,例如解析数字:

          grouped = {k: [float(x[1].strip()) for x in v] for k, v
                     in itertools.groupby(sorted(rows), key=lambda x: x[0])}
          

          输出:

          {'time_connect': [0.460643, 0.460643, 0.463243],
           'time_namelookup': [0.121558, 0.121665, 0.121668],
           'time_pretransfer': [0.460355, 0.460755, 0.460755],
           'time_redirect': [0.0, 0.0, 0.0],
           'time_starttransfer': [0.811697, 0.813697, 0.911697],
           'time_total': [0.811413, 0.811813, 0.811853]}
          

          pandas

          如果你身边有pandas,你可以用它来读取CSV格式的日志,这样可以省去解析和分组数据的麻烦。示例:

          import pandas as pd
          df = pd.read_csv('log.txt', delimiter=':', header=None, names=['Name', 'Num']).dropna().reset_index(drop=True)
          print(df)
          

          将输出解析后的数据并准备使用:

                            Name       Num
          0      time_namelookup  0.121668
          1         time_connect  0.460643
          2     time_pretransfer  0.460755
          3        time_redirect  0.000000
          4   time_starttransfer  0.811697
          5           time_total  0.811813
          6      time_namelookup  0.121665
          7         time_connect  0.460643
          8     time_pretransfer  0.460355
          9        time_redirect  0.000000
          10  time_starttransfer  0.813697
          11          time_total  0.811853
          12     time_namelookup  0.121558
          13        time_connect  0.463243
          14    time_pretransfer  0.460755
          15       time_redirect  0.000000
          16  time_starttransfer  0.911697
          17          time_total  0.811413
          

          现在对数据做任何你想做的事情,例如重塑数据框以获得更结构化的视图:

          df['chunk'] = df.index // df.Name.unique().size
          print(df.pivot(values='Num', columns='Name', index='chunk'))
          
          # Output:
          
          Name   time_connect  time_namelookup  time_pretransfer  time_redirect  time_starttransfer  time_total
          chunk                                                                                                
          0          0.460643         0.121668          0.460755            0.0            0.811697    0.811813
          1          0.460643         0.121665          0.460355            0.0            0.813697    0.811853
          2          0.463243         0.121558          0.460755            0.0            0.911697    0.811413
          

          计算选定时间的统计数据:

          print(df[df.Name == 'time_total'].describe())
          
          # Output:
          
                      Num
          count  3.000000
          mean   0.811693
          std    0.000243
          min    0.811413
          25%    0.811613
          50%    0.811813
          75%    0.811833
          max    0.811853
          

          等等

          【讨论】:

            猜你喜欢
            • 1970-01-01
            • 1970-01-01
            • 2014-05-02
            • 2011-09-12
            • 1970-01-01
            • 1970-01-01
            • 1970-01-01
            • 1970-01-01
            相关资源
            最近更新 更多