从文本中提取行 + python regex答案

【问题标题】：Extract the rows from the text + python regex从文本中提取行 + python regex
【发布时间】：2018-03-30 05:12:19
【问题描述】：

我正在尝试从文本文件中提取整行，但它没有按预期工作。

示例文本文件内容：

data = """Add TTFF LEVERERGE 30 mp -5%
Some Text, Some Text
5882950 Abc Lahd
Pos Sequence Batch datax datay dataz dataa datab
1 00061680 904834 20.35 REV 177,650 5329,50
Bundled 2-rev 42al/xyz
Neon Classic Unit 1300 abc \ 1638\48
2 00012815 55244 815 FWD 164,720 18448,64
UnBundled 2-pag
Mathrine Classic straight Tilt 2 xyz / 23,2x23gb
150st/xyz 20 abc/xyz
3 90072815 65944 212 KRT 164,720 18448,64
UnBundled 2-pag
Mathrine Classic straight Tilt 2 xyz / 23,2x23gb
150st/bunt 20 bunt/bal
Some Valid Text
Some More Valid Text Some More Valid Text"""

我希望列表格式的所有三行都从中提取特定值。

逻辑是：

在我们开始新行之前停止提取
每一行都有一个序列数字（1、2、3、...、99.等）
考虑以“一些有效文本”结尾的最后一行的结尾

（由于前 2 个步骤不起作用，因此 re.findall 中的 #3 此步骤不考虑正则表达式）

$re.findall(r'(^\d{1,2}\s.*?\n^\d)', data, re.DOTALL|re.M)

['1 00061680 904834 20.35 REV 177,650 5329,50\nBundled 2-rev 42al/xyz\nNeon Classic Unit 1300 abc \\ 1638\x048\n2',
 '3 90072815 65944 212 KRT 164,720 18448,64\nUnBundled 2-pag\nMathrine Classic straight Tilt 2 xyz / 23,2x23gb\n1']

预期结果是：

['1 00061680 904834 20.35 REV 177,650 5329,50\nBundled 2-rev 42al/xyz\nNeon Classic Unit 1300 abc \\ 1638\x048\n',
'2 00012815 55244 815 FWD 164,720 18448,64\n    UnBundled 2-pag\n    Mathrine Classic straight Tilt 2 xyz / 23,2x23gb\n    150st/xyz 20 abc/xyz',
'3 90072815 65944 212 KRT 164,720 18448,64\nUnBundled 2-pag\nMathrine Classic straight Tilt 2 xyz / 23,2x23gb\n150st/bunt 20 bunt/bal']

从文本中提取行的任何指导/帮助？

【问题讨论】：

发布最终预期结果
@RomanPerekhrest - 已编辑，感谢您的建议
好的，如果行以2 text ...., 7 text .... , 10 text, 3 text ... 而不是1 .... 2 ... 3 ... 之类的无序数字开头怎么办？

标签： python regex python-3.x

【解决方案1】：

使用re.findall() 函数和特定的正则表达式模式：

rows = re.findall(r'(^\d{1,2} .+?)(?=\n(?:\d+ |Some Valid Tex))', data, re.DOTALL | re.M)
print(rows)

输出：

['1 00061680 904834 20.35 REV 177,650 5329,50\nBundled 2-rev 42al/xyz\nNeon Classic Unit 1300 abc \\ 1638\x048', '2 00012815 55244 815 FWD 164,720 18448,64\nUnBundled 2-pag\nMathrine Classic straight Tilt 2 xyz / 23,2x23gb\n150st/xyz 20 abc/xyz', '3 90072815 65944 212 KRT 164,720 18448,64\nUnBundled 2-pag\nMathrine Classic straight Tilt 2 xyz / 23,2x23gb\n150st/bunt 20 bunt/bal']

【讨论】：

【解决方案2】：

如果您有一个必须“计数”作为模式一部分的正则表达式，我不会使用正则表达式，您应该使用解析器 - 正则表达式用于常规模式，而不是用于计数（尽管有些人在这里创建我认为不可能的正则表达式）。

这是一种简单直接的 -非正则表达式- 方法。最后一项必须清理，因为您没有提供显着的“STOP HERE”标记。我非常怀疑' Some Valid Text Some More Valid Text Some More Valid Text']' 将成为您文本的一部分，因此不符合“停止”的条件。

输出也不包含终止'\n' - 我使用它们将行拆分为-well-行。如果您真的需要它们，可以在join()parts 时添加'\n'：

data = """Add TTFF LEVERERGE 30 mp -5%
Some Text, Some Text
5882950 Abc Lahd
Pos Sequence Batch datax datay dataz dataa datab
1 00061680 904834 20.35 REV 177,650 5329,50
Bundled 2-rev 42al/xyz
Neon Classic Unit 1300 abc \ 1638\48
2 00012815 55244 815 FWD 164,720 18448,64
UnBundled 2-pag
Mathrine Classic straight Tilt 2 xyz / 23,2x23gb
150st/xyz 20 abc/xyz
3 90072815 65944 212 KRT 164,720 18448,64
UnBundled 2-pag
Mathrine Classic straight Tilt 2 xyz / 23,2x23gb
150st/bunt 20 bunt/bal
Some Valid Text
Some More Valid Text Some More Valid Text"""

rdata = data.split('\n')
skipprows = rdata.index('Pos Sequence Batch datax datay dataz dataa datab')
lines = rdata[skipprows + 1:]

i = 1       # looking for this + space at string start to see when 1 line id done
part = []   # collects parts that belong to one line
result = [] # holds the joined lines from part
for li in lines:
    if li.startswith(f'{i} '):            # look for linenr + space
        if part:                          # do not add empty parts
            result.append(' '.join(part)) # add joined if something in it
        part = [li]                       # start with current li for next parts
        i += 1                            # increase so we look for next one
    else:
        part.append(li)

if part:                                  # add last part if not empty
    result.append(' '.join(part))

print(result)                             # print all

输出：

['1 00061680 904834 20.35 REV 177,650 5329,50 Bundled 2-rev 42al/xyz Neon Classic Unit 1300 abc \\ 1638\x048', 
 '2 00012815 55244 815 FWD 164,720 18448,64 UnBundled 2-pag Mathrine Classic straight Tilt 2 xyz / 23,2x23gb 150st/xyz 20 abc/xyz', 
 '3 90072815 65944 212 KRT 164,720 18448,64 UnBundled 2-pag Mathrine Classic straight Tilt 2 xyz / 23,2x23gb 150st/bunt 20 bunt/bal Some Valid Text Some More Valid Text Some More Valid Text']

警告：如果你的台词恰好是这样的：

1 Some thing to eat
and some more data of it, containing
2 packs each
2 Some other thing to eat to get more muscles
and even more text containing 
3 things that make you BIGGGER
3 Last text ....

解析会变得不稳定，您将无法获得所需的正确数据。

【讨论】：