为粗体/下划线字符串查找正确的正则表达式（Python）[重复]答案

【问题标题】：Finding correct regex for a bolded/underlined strings (Python) [duplicate]为粗体/下划线字符串查找正确的正则表达式（Python）[重复]
【发布时间】：2020-08-01 14:42:24
【问题描述】：

所以我想在一个字符串中找到 2 组标准。例如：

import re
bold_pattern = re.compile() #pattern for finding all words in between ** **
underline_pattern = re.compile() # pattern for finding all words in between __ __
a = "__Hello__ **This** __is__ **Lego**"

我将如何在正则表达式上执行此操作？

【问题讨论】：

开始learning capture groups

标签： python regex

【解决方案1】：

使用捕获模式来捕获两个模式之间的单词：

bold_pattern = re.compile(r'\*\*(.*?)\*\*')   # pattern for finding all words in between ** **
underline_pattern = re.compile(r'__(.*?)__')  # pattern for finding all words in between __ __

然后在re.findall中使用它们：

bolds = re.findall(bold_pattern, a)
# or: bold_pattern.findall(a)
underlines = re.findall(underline_pattern, a)
# or: underline_pattern.findall(a)

【讨论】：

谢谢！旁注 - 因为它已经被编译了，所以我会做 bold_pattern.findall(a) 不是吗？

【解决方案2】：

使用re.findall我们可以试试：

a = "__Hello__ **This** __is__ **Lego**"
terms = re.findall(r'\*\*(.*?)\*\*', a)
print(terms)

打印出来：

['This', 'Lego']

【讨论】：

【解决方案3】：

希望这会有所帮助:) 您需要首先在编译中定义模式，然后进一步使用 find all 函数来提取字符串。您也可以按照@Tim Biegeleisen 的建议在 findall 函数中定义模式，从而在一行中完成。

import re
bold_pattern = re.compile(r'\*\*(.*?)\*\*') 
underline_pattern = re.compile(r'\_\_(.*?)\_\_')
a = "__Hello__ **This** __is__ **Lego**"
print(bold_pattern.findall(a))
print(underline_pattern.findall(a))

【讨论】：

【解决方案4】：

建议：

如果您正在处理多行文本（即\n），那么您需要将参数：flags=re.DOTALL 传递给您的re.findall() 方法。

大小写：多行文本

# string to be searched
a = """
__Hello__ **This 
is a multiline test** __it is__ **Lego
**
"""

# pattern variations
bold_pattern = r'\*\*(.*?)\*\*'

# call re functions
match = re.findall(pattern=bold_pattern, string=a)
flag_match = re.findall(pattern=bold_pattern, string=a, flags=re.DOTALL)

# print results for observation
print(match)
print(flag_match) # using the flag

返回：

[' __it is__ ']
['This \nis a multiline test', 'Lego\n']

来自 Python 3.8.2 文档：
“可以通过指定标志值来修改表达式的行为。”

处理 (\n)

根据您的需要，您可以通过几种不同的方式处理\n。如果需要，我将在整个文本正文上使用 re.sub()，然后再执行任何其他操作将它们全部删除。

编译还是不编译？

来自 Python 3.8.2 文档：
“其中一些函数是编译正则表达式的全功能方法的简化版本。大多数非平凡的应用程序总是使用编译形式...
...但是当表达式将在单个程序中多次使用时，使用 re.compile() 并保存生成的正则表达式对象以供重用会更有效。"

和

“传递给 re.compile() 的最新模式的编译版本和模块级匹配函数被缓存，因此一次只使用几个正则表达式的程序不必担心编译正则表达式。 "

因此，除非您使用一大堆模式，否则您不应该从编译中看到明显的改进。

您还可以使用%%time 魔术命令来测试这两个选项，看看您是否在本地发现了优势！

祝你好运！

【讨论】：