【问题标题】:Extract the rows from the text + python regex从文本中提取行 + python regex
【发布时间】:2018-03-30 05:12:19
【问题描述】:

我正在尝试从文本文件中提取整行,但它没有按预期工作。

示例文本文件内容:

data = """Add TTFF LEVERERGE 30 mp -5%
Some Text, Some Text
5882950 Abc Lahd
Pos Sequence Batch datax datay dataz dataa datab
1 00061680 904834 20.35 REV 177,650 5329,50
Bundled 2-rev 42al/xyz
Neon Classic Unit 1300 abc \ 1638\48
2 00012815 55244 815 FWD 164,720 18448,64
UnBundled 2-pag
Mathrine Classic straight Tilt 2 xyz / 23,2x23gb
150st/xyz 20 abc/xyz
3 90072815 65944 212 KRT 164,720 18448,64
UnBundled 2-pag
Mathrine Classic straight Tilt 2 xyz / 23,2x23gb
150st/bunt 20 bunt/bal
Some Valid Text
Some More Valid Text Some More Valid Text"""

我希望列表格式的所有三行都从中提取特定值。

逻辑是:

  1. 在我们开始新行之前停止提取
  2. 每一行都有一个序列数字(1、2、3、...、99.等)
  3. 考虑以“一些有效文本”结尾的最后一行的结尾

(由于前 2 个步骤不起作用,因此 re.findall 中的 #3 此步骤不考虑正则表达式)

$re.findall(r'(^\d{1,2}\s.*?\n^\d)', data, re.DOTALL|re.M)

['1 00061680 904834 20.35 REV 177,650 5329,50\nBundled 2-rev 42al/xyz\nNeon Classic Unit 1300 abc \\ 1638\x048\n2',
 '3 90072815 65944 212 KRT 164,720 18448,64\nUnBundled 2-pag\nMathrine Classic straight Tilt 2 xyz / 23,2x23gb\n1']

预期结果是:

['1 00061680 904834 20.35 REV 177,650 5329,50\nBundled 2-rev 42al/xyz\nNeon Classic Unit 1300 abc \\ 1638\x048\n',
'2 00012815 55244 815 FWD 164,720 18448,64\n    UnBundled 2-pag\n    Mathrine Classic straight Tilt 2 xyz / 23,2x23gb\n    150st/xyz 20 abc/xyz',
'3 90072815 65944 212 KRT 164,720 18448,64\nUnBundled 2-pag\nMathrine Classic straight Tilt 2 xyz / 23,2x23gb\n150st/bunt 20 bunt/bal']

从文本中提取行的任何指导/帮助?

【问题讨论】:

  • 发布最终预期结果
  • @RomanPerekhrest - 已编辑,感谢您的建议
  • 好的,如果行以2 text ...., 7 text .... , 10 text, 3 text ... 而不是1 .... 2 ... 3 ... 之类的无序数字开头怎么办?

标签: python regex python-3.x


【解决方案1】:

使用re.findall() 函数和特定的正则表达式模式:

rows = re.findall(r'(^\d{1,2} .+?)(?=\n(?:\d+ |Some Valid Tex))', data, re.DOTALL | re.M)
print(rows)

输出:

['1 00061680 904834 20.35 REV 177,650 5329,50\nBundled 2-rev 42al/xyz\nNeon Classic Unit 1300 abc \\ 1638\x048', '2 00012815 55244 815 FWD 164,720 18448,64\nUnBundled 2-pag\nMathrine Classic straight Tilt 2 xyz / 23,2x23gb\n150st/xyz 20 abc/xyz', '3 90072815 65944 212 KRT 164,720 18448,64\nUnBundled 2-pag\nMathrine Classic straight Tilt 2 xyz / 23,2x23gb\n150st/bunt 20 bunt/bal']

【讨论】:

    【解决方案2】:

    如果您有一个必须“计数”作为模式一部分的正则表达式,我不会使用正则表达式,您应该使用解析器 - 正则表达式用于常规模式,而不是用于计数(尽管有些人在这里创建我认为不可能的正则表达式)。

    这是一种简单直接的 -非正则表达式- 方法。最后一项必须清理,因为您没有提供显着的“STOP HERE”标记。我非常怀疑' Some Valid Text Some More Valid Text Some More Valid Text']' 将成为您文本的一部分,因此不符合“停止”的条件。

    输出也不包含终止'\n' - 我使用它们将行拆分为-well-行。如果您真的需要它们,可以在join()parts 时添加'\n'

    data = """Add TTFF LEVERERGE 30 mp -5%
    Some Text, Some Text
    5882950 Abc Lahd
    Pos Sequence Batch datax datay dataz dataa datab
    1 00061680 904834 20.35 REV 177,650 5329,50
    Bundled 2-rev 42al/xyz
    Neon Classic Unit 1300 abc \ 1638\48
    2 00012815 55244 815 FWD 164,720 18448,64
    UnBundled 2-pag
    Mathrine Classic straight Tilt 2 xyz / 23,2x23gb
    150st/xyz 20 abc/xyz
    3 90072815 65944 212 KRT 164,720 18448,64
    UnBundled 2-pag
    Mathrine Classic straight Tilt 2 xyz / 23,2x23gb
    150st/bunt 20 bunt/bal
    Some Valid Text
    Some More Valid Text Some More Valid Text"""
    
    rdata = data.split('\n')
    skipprows = rdata.index('Pos Sequence Batch datax datay dataz dataa datab')
    lines = rdata[skipprows + 1:]
    
    i = 1       # looking for this + space at string start to see when 1 line id done
    part = []   # collects parts that belong to one line
    result = [] # holds the joined lines from part
    for li in lines:
        if li.startswith(f'{i} '):            # look for linenr + space
            if part:                          # do not add empty parts
                result.append(' '.join(part)) # add joined if something in it
            part = [li]                       # start with current li for next parts
            i += 1                            # increase so we look for next one
        else:
            part.append(li)
    
    if part:                                  # add last part if not empty
        result.append(' '.join(part))
    
    print(result)                             # print all
    

    输出:

    ['1 00061680 904834 20.35 REV 177,650 5329,50 Bundled 2-rev 42al/xyz Neon Classic Unit 1300 abc \\ 1638\x048', 
     '2 00012815 55244 815 FWD 164,720 18448,64 UnBundled 2-pag Mathrine Classic straight Tilt 2 xyz / 23,2x23gb 150st/xyz 20 abc/xyz', 
     '3 90072815 65944 212 KRT 164,720 18448,64 UnBundled 2-pag Mathrine Classic straight Tilt 2 xyz / 23,2x23gb 150st/bunt 20 bunt/bal Some Valid Text Some More Valid Text Some More Valid Text']
    

    警告:如果你的台词恰好是这样的:

    1 Some thing to eat
    and some more data of it, containing
    2 packs each
    2 Some other thing to eat to get more muscles
    and even more text containing 
    3 things that make you BIGGGER
    3 Last text ....
    

    解析会变得不稳定,您将无法获得所需的正确数据。

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2012-03-05
      相关资源
      最近更新 更多