Python：如何在匹配之间获取字符串？答案

【问题标题】：Python: How to get string between matches?Python：如何在匹配之间获取字符串？
【发布时间】：2014-08-14 23:17:06
【问题描述】：

我有

FILE = open("file.txt", "r") #long text file
TEXT = FILE.read()

#long identification code with dots (.) and slashes (-)
regex = "process \d\d\d\d\d\d\d\-\d\d\.\d\d\d\d\.\d+\.\d\d\.\d\d\d\d"
SRC = re.findall(regex, TEXT, flags=re.IGNORECASE|re.MULTILINE)

如何获取第一次出现的第一个字符 SRC[i] 和下一次出现的第一个字符 SRC[i+1] 等之间的文本？找不到任何直截了当的满意答案...

更多信息编辑：

pattern = 'process \d{7}\-\d{2}\.\d{4}\.\d+\.\d{2}\.\d{4}'

sample_input = "Process 1234567-89.1234.12431242.12.1234 -  text title and long text description with no assured pattern Process 2234567-89.1234.12431242.12.1234 : chars and more text Process 3234567-89.1234.12431242.12.1234 - more text process 3234567-89.1234.12431242.12.1234 (...)"

sample_output[0] = "Process 1234567-89.1234.12431242.12.1234 -  text title and long text description with no assured pattern "
sample_output[1] = "Process 2234567-89.1234.12431242.12.1234 : chars and more text "
sample_output[2] = "Process 3234567-89.1234.12431242.12.1234 - more text "
sample_output[3] = "process 3234567-89.1234.12431242.12.1234    "

【问题讨论】：

请提供一些示例输入和预期输出。
您可以将您的正则表达式缩短为：\d{7}\-\d{2}\.\d{4}\.\d+\.\d{2}\.\d{4}
你到底在问什么？显示您的一些输入，我想拆分可能有用
添加样本并输出

标签： python regex python-2.7

【解决方案1】：

你可以使用这个正则表达式：

(Process \d{7}\-\d{2}\.\d{4}\.\d+\.\d{2}\.\d{4}.*?)(?=Process)|(Process \d{7}\-\d{2}\.\d{4}\.\d+\.\d{2}\.\d{4}.*)

Working demo

)

比赛信息

MATCH 1
1.  [0-105] `Process 1234567-89.1234.12431242.12.1234 -  text title and long text description with no assured pattern `
MATCH 2
1.  [105-168]   `Process 2234567-89.1234.12431242.12.1234 : chars and more text `
MATCH 3
1.  [168-221]   `Process 3234567-89.1234.12431242.12.1234 - more text `
MATCH 4
2.  [221-267]   `Process 3234567-89.1234.12431242.12.1234 (...)`

您可以使用此代码：

sample_input = "Process 1234567-89.1234.12431242.12.1234 -  text title and long text description with no assured pattern Process 2234567-89.1234.12431242.12.1234 : chars and more text Process 3234567-89.1234.12431242.12.1234 - more text process 3234567-89.1234.12431242.12.1234 (...)"
m = re.match(r"(Process \d{7}\-\d{2}\.\d{4}\.\d+\.\d{2}\.\d{4}.*?)(?=Process)|(Process \d{7}\-\d{2}\.\d{4}\.\d+\.\d{2}\.\d{4}.*)", sample_input)
m.group(1)       # The first parenthesized subgroup.
m.groups()       # Return a tuple containing all the subgroups of the match, from 1 up to however many groups are in the pattern

【讨论】：

正则表达式看起来不错，但我不知道使用什么re python 函数以及如何解析找到的匹配项
@Sarchophagi 这是一个简单的正则表达式，您只需要从捕获组中获取内容。如果您输入我在答案中提供的链接，您可以转到代码生成器部分，了解如何使用它
不错。让它与示例一起使用，但不适用于真正的大文件。它没有匹配。也许它太大了？ (~4Mb .txt) 或者可能是因为它有特殊的字符，比如 á ã à ó é？
@Sarchophagi 这将是一个与当前问题不同的问题。我认为你应该创建一个新问题来询问这个问题，你可以考虑关闭这个问题，因为我已经回答过了
事实上m.groups() 只返回一个带有("first ocorrence", none) 的元组。奇怪..

【解决方案2】：

假设你有一个字符串some_str = 'abcARelevant_SubstringAcba'，并且你想要第一个A和第二个A之间的字符串；即所需的输出是'Relevant_Substring'。

您可以使用以下行在some_str 中找到A 出现的索引：
inds = [a.start() for a in re.finditer('A', some_str)]

所以现在inds = [3, 22]。现在some_str[inds[0]+1:inds[1] 将包含'Relevant_Substring'。

这应该可以扩展到您的问题。

编辑：这是一个具体的例子。

假设您有一个包含以下文本的文件“file.txt”：

Stuff I don't want.
0
Stuff I do want.
1
More stuff I don't want.

您想使用所有数字 (0-9) 作为分隔符。因此，上面的0 和1 都将充当分隔符。试试下面的代码：

import re
with open("file.txt", "r") as file:
    data = file.read()
patt = re.compile('[0-9]')
inds = [a.start() for a in re.finditer(patt, data)]
print data[inds[0]+1:inds[1]]

这应该打印出Stuff I do want.

【讨论】：

【解决方案3】：

你不需要 re 来查找两个字符之间的字符串：

some_str = 'abcARelevant_SubstringAcba'
print some_str.split("A",2)[1]
Relevant_Substring

【讨论】：