【问题标题】:How to parse a multiline string which looks like a table into a list of dictionaries?如何将看起来像表格的多行字符串解析为字典列表?
【发布时间】:2021-12-27 09:40:56
【问题描述】:

我在下面有一个多行字符串示例,它具有类似表格的结构。我必须解析该结构并将其转换回键值对,以便键是列标题,值是该行的值。我使用了一个正则表达式,但它不能正常工作/

PFB 字符串:

Number of Critical alarms:  0
Number of Major alarms:     0
Number of Minor alarms:     0

 Slot        Sensor          Current State   Reading        Threshold(Minor,Major,Critical,Shutdown)
 ----------  --------------  --------------- ------------   ---------------------------------------
 P0          PEM Iout        Normal          5    A         na
 P0          PEM Vout        Normal          12   V DC      na
 P0          PEM Vin         Normal          242  V AC      na
 P0          Temp: PEM In    Normal          34   Celsius   (80 ,90 ,95 ,100)(Celsius)
 P0          Temp: PEM Out   Normal          30   Celsius   (80 ,90 ,95 ,100)(Celsius)
 R0          Temp: FC FANS   Fan Speed 60%   23   Celsius   (25 ,35 ,0  )(Celsius)
 P0          Temp: FC FAN0   Fan Speed 60%   23   Celsius   (25 ,35 ,0  )(Celsius)
 P1          Temp: FC FAN1   Fan Speed 60%   23   Celsius   (25 ,35 ,0  )(Celsius)

预期输出:

[{'Slot': 'P0', 'Sensor': 'PEM Iout', 'Current State': 'Normal', 'Reading': '5 A', 'Threshold': 'na'}, ...]

我使用了以下正则表达式模式:

r'^(?P<Slot>[^\s]+)[ \t]+(?P<Sensor>[a-zA-Z0-9:]+ [a-z0-9A-Z.:-]* [a-z0-9]*)[ \t]+(?P<State>[a-zA-Z]*)[ \t]+'

【问题讨论】:

  • 取每一行并用\s{3,}分割-见a demo on regex101.com
  • @Jan 我相信这会拆分同一列的5 A...
  • 列的宽度是否始终相同?

标签: python regex


【解决方案1】:

如果列总是具有相同的宽度:

pfb="""Number of Critical alarms:  0
Number of Major alarms:     0
Number of Minor alarms:     0

 Slot        Sensor          Current State   Reading        Threshold(Minor,Major,Critical,Shutdown)
 ----------  --------------  --------------- ------------   ---------------------------------------
 P0          PEM Iout        Normal          5    A         na
 P0          PEM Vout        Normal          12   V DC      na
 P0          PEM Vin         Normal          242  V AC      na
 P0          Temp: PEM In    Normal          34   Celsius   (80 ,90 ,95 ,100)(Celsius)
 P0          Temp: PEM Out   Normal          30   Celsius   (80 ,90 ,95 ,100)(Celsius)
 R0          Temp: FC FANS   Fan Speed 60%   23   Celsius   (25 ,35 ,0  )(Celsius)
 P0          Temp: FC FAN0   Fan Speed 60%   23   Celsius   (25 ,35 ,0  )(Celsius)
 P1          Temp: FC FAN1   Fan Speed 60%   23   Celsius   (25 ,35 ,0  )(Celsius)"""

for line in pfb.splitlines()[6:]:
  slot      = line[ 0:13].strip()
  sensor    = line[13:29].strip()
  current   = line[29:42].strip()
  reading   = line[42:60].strip()
  threshold = line[60:  ].strip()

  # Use the parts and process some fields further
  ...

【讨论】:

    【解决方案2】:

    除了破折号线 (---) 之外,在这里很难找到可用的模式。我会做一些手工工作:

    import re
    
    s = """Number of Critical alarms:  0
    Number of Major alarms:     0
    Number of Minor alarms:     0
    
     Slot        Sensor          Current State   Reading        Threshold(Minor,Major,Critical,Shutdown)
     ----------  --------------  --------------- ------------   ---------------------------------------
     P0          PEM Iout        Normal          5    A         na
     P0          PEM Vout        Normal          12   V DC      na
     P0          PEM Vin         Normal          242  V AC      na
     P0          Temp: PEM In    Normal          34   Celsius   (80 ,90 ,95 ,100)(Celsius)
     P0          Temp: PEM Out   Normal          30   Celsius   (80 ,90 ,95 ,100)(Celsius)
     R0          Temp: FC FANS   Fan Speed 60%   23   Celsius   (25 ,35 ,0  )(Celsius)
     P0          Temp: FC FAN0   Fan Speed 60%   23   Celsius   (25 ,35 ,0  )(Celsius)
     P1          Temp: FC FAN1   Fan Speed 60%   23   Celsius   (25 ,35 ,0  )(Celsius)"""
    
    # "strip" the first lines
    lines = s.splitlines()[4:]
    
    # extract the indexes of the columns according to the dashes line.
    # add 0 and None to cover the edges
    indexes = [0] + [m.start() for m in re.finditer(r'\s+', lines[1].strip())] + [None]
    # zip the indexes into couples of start-finish
    start_finish_indexes = list(zip(indexes, indexes[1:]))
    # extract the headers according to the indexes
    headers = [lines[0][start:finish].strip() for start, finish in start_finish_indexes]
    
    res = []
    for line in lines[2:]:
        # same as with the headers
        columns = [line[start:finish].strip() for start, finish in start_finish_indexes]
        # add a dict with keys as headers and the values are the values of the row
        res.append(dict(zip(headers, columns)))
    
    print(res)
    

    给予:

    [{'Slot': 'P0', 'Sensor': 'PEM Iout', 'Current State': 'Normal', 'Reading': '5    A', 'Threshold(Minor,Major,Critical,Shutdown)': 'na'}, ...]
    

    【讨论】:

    • 我会扫描带有 ---- 的行以用作截止而不是固定数字
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2020-02-29
    • 2010-10-28
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2013-12-13
    • 1970-01-01
    相关资源
    最近更新 更多