Python通过正则表达式从文件中读取行答案

【问题标题】：Python read lines from file by regexPython通过正则表达式从文件中读取行
【发布时间】：2018-04-06 04:10:39
【问题描述】：

我有一个文本文件，我想以某种格式读入列表。

当我写作时：

with open('chat_history.txt', encoding='utf8') as f:
    mylist = [line.rstrip('\n') for line in f]

我明白了：

27/08/15, 15:45 - text
continue text
continue text 2
27/08/15, 16:10 - new text
new text 2
new text 3
27/08/15, 19:55 - more text

我想得到：

27/08/15, 15:45 - text continue text continue text 2
27/08/15, 16:10 - new text new text 2 new text 3
27/08/15, 19:55 - more text

我只想在格式为\nDD/MM/YY, HH:MM - 时进行拆分不幸的是，我不是正则表达式方面的专家。我试过了：

with open('chat_history.txt', encoding='utf8') as f:
    mylist = [line.rstrip('\n'r'[\d\d/\d\d/\d\d - ]') for line in f]

这给出了相同的结果。再想一想，为什么它不起作用是有道理的。不过希望得到一些帮助。

【问题讨论】：

为什么不直接测试当前行，如果匹配则先输出一个换行符？
文件长什么样？
该文件类似于27/08/15, 15:45 - text continue text continue text 2，但是当我读取行时我得到27/08/15, 15:45 - text\ncontinue text\ncontinue text 2 @IgnacioVazquez-Abrams 没有不需要的数据。我正在使用所有东西，我只是希望它采用正确的格式
@sheldonzy，你试过open('chat_history.txt', encoding='utf8', newline='') 吗？

标签： python regex file split strip

【解决方案1】：

with open('chat_history.txt', encoding='utf8') as f:
    l = [line.rstrip('\n').replace('\n', ' ') for line in f]

print(l)

【讨论】：

【解决方案2】：

我的解决方案使用比 Jan 的更简单的正则表达式。不过，使用正则表达式的代码稍微冗长一些。

一、输入文件：

$ cat -e chat_history.txt
27/08/15, 15:45 - text$
continue text$
continue text 2$
27/08/15, 16:10 - new text$
new text 2$
new text 3$
27/08/15, 19:55 - more text$

代码：

import re

date_time_regex = re.compile(r'^\d{2}/\d{2}/\d{2}, \d{2}:\d{2} - .*')

with open('chat_history.txt', encoding='utf8') as f:
    first_date = True
    for line in f:
        line = line.rstrip('\n')

        if date_time_regex.match(line):
            if not first_date:
                # Print a newline character before printing a date
                # if it is not the first date.
                print()
            else:
                first_date = False
        else:
            # Print a separator, without a newline character.
            print(' ', end='')

        # Print the original line, without a newline character.
        print(line, end='')

# Print the last newline character.
print()

运行代码（并且不显示尾随空格）：

$ python3 chat.py | cat -e
27/08/15, 15:45 - text continue text continue text 2$
27/08/15, 16:10 - new text new text 2 new text 3$
27/08/15, 19:55 - more text$

【讨论】：

【解决方案3】：

诚然，这可能是超越顶部的方式，我相信还有其他可能实现相同的目标。我想在这里使用较新的regex module 以(?(DEFINE)...) 介绍我的解决方案。先上代码，再解释：

import regex as re

string = """
27/08/15, 15:45 - text
continue text
continue text 2
27/08/15, 16:10 - new text
new text 2
new text 3
27/08/15, 19:55 - more text
"""

rx = re.compile(r'''
    (?(DEFINE)
        (?P<date>\d{2}/\d{2}/\d{2},\ \d{2}:\d{2}) # the date format
    )
    ^                    # anchor, start of the line
    (?&date)             # the previously defined format
    (?:(?!^(?&date)).)+  # "not date" as long as possible
''', re.M | re.X | re.S)


entries = (m.group(0).replace('\n', ' ') for m in rx.finditer(string))
for entry in entries:
    print(entry)

这会产生：

27/08/15, 15:45 - text continue text continue text 2 
27/08/15, 16:10 - new text new text 2 new text 3 
27/08/15, 19:55 - more text

基本上，这种方法会查找日期块，中间用文本分隔：

date
text1
text2
date
text3
date
text

... 把它们放在一起就像

date text1 text2
date text3
date text

在日期组中定义“日期格式”，其后结构如下

date "match as long as there's no date in the next line"

这是通过负前瞻实现的。之后，所有找到的换行符都被替换为空格（即在理解中）。
显然，如果没有 regex 模块和 (?(DEFINE) 块，我们也可以得到相同的结果，但是我们必须在匹配和前瞻中重复自己。
最后，表达式见a demo on regex101.com。

【讨论】：

你不能用\n(?=\d{2}/\d{2}/\d{2},\ \d{2}:\d{2})分割吗
@SebastianProske：是的。