如何将看起来像表格的多行字符串解析为字典列表？答案

【问题标题】：How to parse a multiline string which looks like a table into a list of dictionaries?如何将看起来像表格的多行字符串解析为字典列表？
【发布时间】：2021-12-27 09:40:56
【问题描述】：

我在下面有一个多行字符串示例，它具有类似表格的结构。我必须解析该结构并将其转换回键值对，以便键是列标题，值是该行的值。我使用了一个正则表达式，但它不能正常工作/

PFB 字符串：

Number of Critical alarms:  0
Number of Major alarms:     0
Number of Minor alarms:     0

 Slot        Sensor          Current State   Reading        Threshold(Minor,Major,Critical,Shutdown)
 ----------  --------------  --------------- ------------   ---------------------------------------
 P0          PEM Iout        Normal          5    A         na
 P0          PEM Vout        Normal          12   V DC      na
 P0          PEM Vin         Normal          242  V AC      na
 P0          Temp: PEM In    Normal          34   Celsius   (80 ,90 ,95 ,100)(Celsius)
 P0          Temp: PEM Out   Normal          30   Celsius   (80 ,90 ,95 ,100)(Celsius)
 R0          Temp: FC FANS   Fan Speed 60%   23   Celsius   (25 ,35 ,0  )(Celsius)
 P0          Temp: FC FAN0   Fan Speed 60%   23   Celsius   (25 ,35 ,0  )(Celsius)
 P1          Temp: FC FAN1   Fan Speed 60%   23   Celsius   (25 ,35 ,0  )(Celsius)

预期输出：

[{'Slot': 'P0', 'Sensor': 'PEM Iout', 'Current State': 'Normal', 'Reading': '5 A', 'Threshold': 'na'}, ...]

我使用了以下正则表达式模式：

r'^(?P<Slot>[^\s]+)[ \t]+(?P<Sensor>[a-zA-Z0-9:]+ [a-z0-9A-Z.:-]* [a-z0-9]*)[ \t]+(?P<State>[a-zA-Z]*)[ \t]+'

【问题讨论】：

取每一行并用\s{3,}分割-见a demo on regex101.com。
@Jan 我相信这会拆分同一列的5 A...
列的宽度是否始终相同？

标签： python regex

【解决方案1】：

如果列总是具有相同的宽度：

pfb="""Number of Critical alarms:  0
Number of Major alarms:     0
Number of Minor alarms:     0

 Slot        Sensor          Current State   Reading        Threshold(Minor,Major,Critical,Shutdown)
 ----------  --------------  --------------- ------------   ---------------------------------------
 P0          PEM Iout        Normal          5    A         na
 P0          PEM Vout        Normal          12   V DC      na
 P0          PEM Vin         Normal          242  V AC      na
 P0          Temp: PEM In    Normal          34   Celsius   (80 ,90 ,95 ,100)(Celsius)
 P0          Temp: PEM Out   Normal          30   Celsius   (80 ,90 ,95 ,100)(Celsius)
 R0          Temp: FC FANS   Fan Speed 60%   23   Celsius   (25 ,35 ,0  )(Celsius)
 P0          Temp: FC FAN0   Fan Speed 60%   23   Celsius   (25 ,35 ,0  )(Celsius)
 P1          Temp: FC FAN1   Fan Speed 60%   23   Celsius   (25 ,35 ,0  )(Celsius)"""

for line in pfb.splitlines()[6:]:
  slot      = line[ 0:13].strip()
  sensor    = line[13:29].strip()
  current   = line[29:42].strip()
  reading   = line[42:60].strip()
  threshold = line[60:  ].strip()

  # Use the parts and process some fields further
  ...

【讨论】：

【解决方案2】：

除了破折号线 (---) 之外，在这里很难找到可用的模式。我会做一些手工工作：

“剥离”表格前的第一行。
Check the size (enclosing indexes) of each column 根据虚线。
根据提取的索引对行进行切片。
strip 来自空格。
通过zipping 将标题与当前行保存到dict。

import re

s = """Number of Critical alarms:  0
Number of Major alarms:     0
Number of Minor alarms:     0

 Slot        Sensor          Current State   Reading        Threshold(Minor,Major,Critical,Shutdown)
 ----------  --------------  --------------- ------------   ---------------------------------------
 P0          PEM Iout        Normal          5    A         na
 P0          PEM Vout        Normal          12   V DC      na
 P0          PEM Vin         Normal          242  V AC      na
 P0          Temp: PEM In    Normal          34   Celsius   (80 ,90 ,95 ,100)(Celsius)
 P0          Temp: PEM Out   Normal          30   Celsius   (80 ,90 ,95 ,100)(Celsius)
 R0          Temp: FC FANS   Fan Speed 60%   23   Celsius   (25 ,35 ,0  )(Celsius)
 P0          Temp: FC FAN0   Fan Speed 60%   23   Celsius   (25 ,35 ,0  )(Celsius)
 P1          Temp: FC FAN1   Fan Speed 60%   23   Celsius   (25 ,35 ,0  )(Celsius)"""

# "strip" the first lines
lines = s.splitlines()[4:]

# extract the indexes of the columns according to the dashes line.
# add 0 and None to cover the edges
indexes = [0] + [m.start() for m in re.finditer(r'\s+', lines[1].strip())] + [None]
# zip the indexes into couples of start-finish
start_finish_indexes = list(zip(indexes, indexes[1:]))
# extract the headers according to the indexes
headers = [lines[0][start:finish].strip() for start, finish in start_finish_indexes]

res = []
for line in lines[2:]:
    # same as with the headers
    columns = [line[start:finish].strip() for start, finish in start_finish_indexes]
    # add a dict with keys as headers and the values are the values of the row
    res.append(dict(zip(headers, columns)))

print(res)

给予：

[{'Slot': 'P0', 'Sensor': 'PEM Iout', 'Current State': 'Normal', 'Reading': '5    A', 'Threshold(Minor,Major,Critical,Shutdown)': 'na'}, ...]

【讨论】：

我会扫描带有 ---- 的行以用作截止而不是固定数字