如何在特定单词之后提取行？答案

【问题标题】：How to extract lines after specific words?如何在特定单词之后提取行？
【发布时间】：2019-10-10 15:13:27
【问题描述】：

我想在 python 3 中使用正则表达式获取文本中的日期和特定项目。下面是一个示例：

text = '''
190219 7:05:30 line1 fail
               line1 this is the 1st fail
               line2 fail
               line2 this is the 2nd fail
               line3 success 
               line3 this is the 1st success process
               line3 this process need 3sec
200219 9:10:10 line1 fail
               line1 this is the 1st fail
               line2 success 
               line2 this is the 1st success process
               line2 this process need 4sec
               line3 success 
               line3 this is the 2st success process
               line3 this process need 2sec

'''

在上面的示例中，我想获取“成功行”之后的所有行。这里需要的输出：

[('190219','7:05:30','line3 this is the 1st success process', 'line3 this process need 3sec'),
('200219', '9:10:10', 'line2 this is the 1st success process', 'line2 this process need 4sec', 'line3 this is the 2st success process','line3 this process need 2sec')]

这是我试过的：

>>> newLine = re.sub(r'\t|\n|\r|\s{2,}',' ', text)
>>> newLine
>>> Out[3]: ' 190219 7:05:30 line1 fail  line1 this is the 1st fail  line2 fail  line2 this is the 2nd fail  line3 success line3 this is the 1st success process  line3 this process need 3sec 200219 9:10:10 line1 fail  line1 this is the 1st fail  line2 success line2 this is the 1st success process  line2 this process need 4sec  line3 success line3 this is the 2st success process  line3 this process need 2sec  '

我不知道获得结果的正确方法是什么。我试过这个来得到这条线：

(\b\d{6}\b \d{1,}:\d{2}:\d{2})...

我该如何解决这个问题？

【问题讨论】：

一切都必须在严格的正则表达式中完成吗？部分解决方案可以不是正则表达式吗？
不，不是。我只是提到了我尝试过的东西。我不知道其他方式@kosayoda
这个模式怎么样：^(\d[ \d:]+\d)(?:.*\n\B)*?.*success.*\n((?:\B.*\n?)+)

标签： python regex python-3.x string findall

【解决方案1】：

这是一个解决方案，它使用正则表达式获取日期，并使用常规 Python 获取其他所有内容。

准备输入：

text = '''
190219 7:05:30 line1 fail
               line1 this is the 1st fail
               line2 fail
               line2 this is the 2nd fail
               line3 success
               line3 this is the 1st success process
               line3 this process need 3sec
200219 9:10:10 line1 fail
               line1 this is the 1st fail
               line2 success
               line2 this is the 1st success process
               line2 this process need 4sec
               line3 success
               line3 this is the 2st success process
               line3 this process need 2sec
'''

# Strip the multiline string, split into lines, then strip each line
lines = [line.strip() for line in text.strip().splitlines()]
result = parse(lines)

解决方案：

import re

def parse(lines):
    result = []
    buffer = []

    success = False
    for line in lines:
        date = re.match(r"(\d{6})\s(\d{1,}:\d{2}:\d{2})", line)
        if date:
            # Store previous match and reset buffer
            if buffer:
                result.append(tuple(buffer))
                buffer.clear()
            # Split the date and time and add to buffer
            buffer.extend(date.groups())
        # Check for status change
        if line.endswith("success") or line.endswith("fail"):
            success = True if line.endswith("success") else False
        # Add current line to buffer if it's part of the succeeded process
        else:
            if success:
                buffer.append(line)
    # Store last match
    result.append(tuple(buffer))
    return result

输出：

result = [('190219', '7:05:30', 'line3 this is the 1st success process', 'line3 this process need 3sec'), ('200219', '9:10:10', 'line2 this is the 1st success process', 'line2 this process need 4sec', 'line3 this is the 2st success process', 'line3 this process need 2sec')]

【讨论】：

【解决方案2】：

使用来自 itertools 的 groupby 是类似的解决方案：

import re
from itertools import groupby

def parse(lines):
    result = []
    buffer, success_block = [], False
    for date, block in groupby(lines, key=lambda l: re.match(r"(\d{6})\s(\d{1,}:\d{2}:\d{2})", l)):
        if date:
            buffer = list(date.groups())
            success_block = next(block).endswith('success')
            continue
        for success, b in groupby(block, key=lambda l: re.match(r".*line\d\ssuccess$", l)):
            if success:
                success_block = True
                continue
            if success_block:
                buffer.extend(b)

        result.append(tuple(buffer))
        buffer = []
    return result

【讨论】：

【解决方案3】：

如果您更喜欢功能更强大、更优雅的代码，那么下面的代码应该可以工作。我在 python 中使用了一个名为toolz 的函数库。你可以通过pip install toolz 来安装它。下面的代码没有使用任何正则表达式，而只是使用了partitions 和filters。请使用包含文本的文件更改input_file 并尝试。


from toolz import partitionby, partition
from itertools import dropwhile

input_file = r'input_file.txt'


def line_starts_empty(line):
    return line.startswith(' ')


def clean(line):
    return line.strip()


def contains_no_success(line):
    return 'success' not in line.lower()


def parse(args):
    head_line, tail_lines = args
    result_head = head_line[0].split()[:2]
    result_tail = list(map(clean, dropwhile(contains_no_success, tail_lines)))
    return result_head + result_tail


for item in map(parse, partition(2, partitionby(line_starts_empty, open(input_file)))):
    print(item)

【讨论】：

【解决方案4】：

这是我使用正则表达式的解决方案：

text = '''
190219 7:05:30 line1 fail
               line1 this is the 1st fail
               line2 fail
               line2 this is the 2nd fail
               line3 success 
               line3 this is the 1st success process
               line3 this process need 3sec
200219 9:10:10 line1 fail
               line1 this is the 1st fail
               line2 success 
               line2 this is the 1st success process
               line2 this process need 4sec
               line3 success 
               line3 this is the 2st success process
               line3 this process need 2sec
'''

# find desired lines
count = 0
data = []
for item in text.splitlines():
    # find date
    match_date = re.search('\d+\s\d+:\d\d:\d\d', item)
    # get date
    if match_date != None:
        count = 1
        date_time = match_date.group().split(' ')
        for item in date_time:
            data.append(item)
    # find line with success
    match = re.search('\w+\d\ssuccess',item)
    # handle collecting next lines
    if match != None:
        count = 2

    if count > 2:
        data.append(item.strip())

    if count == 2:
        count += 1

# split list data
# find integers i list
numbers = []
for item in data:
     numbers.append(item.isdigit())

# get positions of integers
indexes = [i for i,x in enumerate(numbers) if x == True]
number_of_elements = len(data)
indexes = indexes + [number_of_elements]

# create list of list
result = []
for i in range(0, len(indexes)-1):
    result.append(data[indexes[i]:indexes[i+1]])

结果：

[['190219', '7:05:30', 'line3 this is the 1st success process', 'line3 this process need 3sec'], ['200219', '9:10:10', 'line2 this is the 1st success process', 'line2 this process need 4sec', 'line3 this is the 2st success process', 'line3 this process need 2sec']]

【讨论】：