Python 正则表达式：将大文本文件拆分为较小的部分答案

【问题标题】：Python regex: Split a large text file into smaller partsPython 正则表达式：将大文本文件拆分为较小的部分
【发布时间】：2021-08-11 21:01:13
【问题描述】：

考虑以下文本文件。

 ~~~~~~~~~~~~~~~~~~~~~~~
|                       |
| First Block of text   |
|                       |
 ~~~~~~~~~~~~~~~~~~~~~~~

----------------------- Monday 8 August 2021 -----------------------

 ~~~~~~~~~~~~~~~~~~~~~~~
|                       |
| Second Block of text  |
|                       |
 ~~~~~~~~~~~~~~~~~~~~~~~

----------------------- Friday 12 August 2021 -----------------------

 ~~~~~~~~~~~~~~~~~~~~~~~
|                       |
| 3rd Block of text     |
|                       |
 ~~~~~~~~~~~~~~~~~~~~~~~
 
----------------------- Friday 19 August 2021 -----------------------

 ~~~~~~~~~~~~~~~~~~~~~~~
|                       |
| 4th Block of text     |
|                       |
 ~~~~~~~~~~~~~~~~~~~~~~~

如何提取第二个、第三个和第四个块并根据上面给出的日期保存它们？例如，我需要提取

中的所有行

     ~~~~~~~~~~~~~~~~~~~~~~~
    |                       |
    | Second Block of text  |
    |                       |
     ~~~~~~~~~~~~~~~~~~~~~~~

然后将其保存到名称为 Monday 8 August 2021 的文件或变量中。

使用以下正则表达式，我可以找到包含日期的行：https://regex101.com/r/nKW1W4/1

-(?P<date>.*?)-

【问题讨论】：

部分标题中是否总是有 23 个 - 字符？
@fsimonjetz 是的，通常是这样。

标签： python regex

【解决方案1】：

您可以使用以下表达式匹配您的块并使用第一组作为文件名：

^
-+([^-]+)-+$
(.+?(?=^--|\Z))

见a demo on regex101.com（注意修饰符）。

【讨论】：

感谢您的回答。我的问题是如何根据找到的相应日期提取每个块：regex101.com/r/CqYyZk/1

【解决方案2】：

你可以使用：

input_text = """
 ~~~~~~~~~~~~~~~~~~~~~~~
|                       |
| First Block of text   |
|                       |
 ~~~~~~~~~~~~~~~~~~~~~~~

----------------------- Monday 8 August 2021 -----------------------

 ~~~~~~~~~~~~~~~~~~~~~~~
|                       |
| Second Block of text  |
|                       |
 ~~~~~~~~~~~~~~~~~~~~~~~

----------------------- Friday 12 August 2021 -----------------------

 ~~~~~~~~~~~~~~~~~~~~~~~
|                       |
| 3rd Block of text     |
|                       |
 ~~~~~~~~~~~~~~~~~~~~~~~
 
----------------------- Friday 19 August 2021 -----------------------

 ~~~~~~~~~~~~~~~~~~~~~~~
|                       |
| 4th Block of text     |
|                       |
 ~~~~~~~~~~~~~~~~~~~~~~~
"""

a = re.split(r'-+(.*?)-+', a)

for k, v in enumerate(a):
    a[k] = a[k].strip()

print(a)

列出理解哪个更简洁suggested by @fsimonjetz

input_text = """
 ~~~~~~~~~~~~~~~~~~~~~~~
|                       |
| First Block of text   |
|                       |
 ~~~~~~~~~~~~~~~~~~~~~~~

----------------------- Monday 8 August 2021 -----------------------

 ~~~~~~~~~~~~~~~~~~~~~~~
|                       |
| Second Block of text  |
|                       |
 ~~~~~~~~~~~~~~~~~~~~~~~

----------------------- Friday 12 August 2021 -----------------------

 ~~~~~~~~~~~~~~~~~~~~~~~
|                       |
| 3rd Block of text     |
|                       |
 ~~~~~~~~~~~~~~~~~~~~~~~
 
----------------------- Friday 19 August 2021 -----------------------

 ~~~~~~~~~~~~~~~~~~~~~~~
|                       |
| 4th Block of text     |
|                       |
 ~~~~~~~~~~~~~~~~~~~~~~~
"""

result = [x.strip() for x in re.split(r'-+(.*?)-+', input_text)]

【讨论】：

我打算用re.split() 发布一个解决方案，但你打败了我:) 只有一个建议：如果你像r'-+(.*?)-+' 那样移动括号，结果将排除前导和尾随'-'。然后我们可以.strip()结果列表中的项目来获得非常干净的数据来处理。
@yo.go 谢谢；一个非常简单的答案。
@fsimonjetz 您能否就如何使用.strip() 发表答案？
@fsimonjetz 没有正确阅读问题，我的错（：谢谢。
@sci9 yo.go 刚刚更新了他们的帖子，我会选择更简洁的列表理解：result = [x.strip() for x in re.split(r'-+(.*?)-+', input_text)]。随心所欲！

【解决方案3】：

在您的模式中，您只在左侧和右侧匹配一个 -，而 .*? 匹配 0+ 个字符而不是换行符非贪婪。

这会给你很多部分匹配而不是匹配整行。

您也可以使用匹配，并将捕获组 1 用作文件名，将捕获组 2 用作数据。

^-+([^-]+)-+((?:\n(?!--).*)*)

说明

^ 字符串开始
-+匹配1+次-
([^-]+) 捕获 group 1 作为日期部分，匹配除- 之外的所有字符
-+匹配1+次-
( 为数据部分捕获 group 2
- (?:\n(?!--).*)* 例如匹配所有不以-- 开头的行
)关闭第二组

Regex demo

例如

import re

pattern = r"^-+([^-]+)-+((?:\n(?!--).*)*)"

s = (" ~~~~~~~~~~~~~~~~~~~~~~~\n"
    "|                      |\n"
    "| First Block of text   |\n"
    "|                      |\n"
    " ~~~~~~~~~~~~~~~~~~~~~~~\n\n"
    "----------------------- Monday 8 August 2021 -----------------------\n\n"
    " ~~~~~~~~~~~~~~~~~~~~~~~\n"
    "|                      |\n"
    "| Second Block of text  |\n"
    "|                      |\n"
    " ~~~~~~~~~~~~~~~~~~~~~~~\n\n"
    "----------------------- Friday 12 August 2021 -----------------------\n\n"
    " ~~~~~~~~~~~~~~~~~~~~~~~\n"
    "|                      |\n"
    "| 3rd Block of text     |\n"
    "|                      |\n"
    " ~~~~~~~~~~~~~~~~~~~~~~~\n"
    " \n"
    "----------------------- Friday 19 August 2021 -----------------------\n\n"
    " ~~~~~~~~~~~~~~~~~~~~~~~\n"
    "|                      |\n"
    "| 4th Block of text     |\n"
    "|                      |\n"
    " ~~~~~~~~~~~~~~~~~~~~~~~\n")

matches = re.findall(pattern, s, re.M)
if matches:
    filename = matches[0][0].strip();
    data = matches[0][1].strip();
    
    print(filename)
    print(data)

输出

Monday 8 August 2021
~~~~~~~~~~~~~~~~~~~~~~~
|                       |
| Second Block of text  |
|                       |
 ~~~~~~~~~~~~~~~~~~~~~~~

【讨论】：

感谢您的回答；这里出现的另一个问题是我们可以使用什么正则表达式来查找和提取First block of text？例如：regex101.com/r/XAQQ5x/1
@sci9 喜欢这个？ ^(?P<message>.*(?:\n(?!--).*)*) regex101.com/r/8iFmCd/1 你可以使用 re.match 来实现