如何正确使用 re.sub 和捕获组？答案

【问题标题】：How to use re.sub with capturing groups correctly?如何正确使用 re.sub 和捕获组？
【发布时间】：2020-03-30 10:00:05
【问题描述】：

我有一个如下所示的 URL：

url = https://www.sx.com/found/text.html

我想用捕获组替换第三个和第四个斜杠之间的文本，即我想用一个新的字符串（新闻）替换“找到”，如下所示：

replace = re.sub(r'(?:/.*/)(.*)/', r'/news/\1', url)

想要的结果：

replace = https://www.sx.com/news/text.html

但是我得到了这个结果：

https:/news/text.html

我在这里做错了什么？

【问题讨论】：

使用re.sub(r'^(https?://[^/]*/)[^/]+/', r'\1news/', url) (demo)

标签： python regex

【解决方案1】：

虽然你应该使用urllib来做这些事情，但是你可以试试

(//.*/).*/

替换为

\1news/

查看演示。

https://regex101.com/r/cuNe0j/1

或者你可以试试这个。有了这个你就不需要关心url解析了。

from urlparse import urlparse, urlunsplit
x= urlparse("https://www.sx.com/found/text.html")
y= x.path.replace("found", "news")
print urlunsplit([x.scheme, x.netloc, y,
         x.query, x.fragment])

【讨论】：

【解决方案2】：

你可以使用：

>>> url = 'https://www.sx.com/found/text.html'
>>> print ( re.sub(r'(.+/)[^/]+(/[^/]*/?)$', r'\1news\2', url) )
https://www.sx.com/news/text.html

正则表达式详细信息：

(.+/)：贪婪匹配任何字符的 1+，后跟 /。捕获组 #1
[^/]+：匹配任何不是 / 的字符的 1+
(/[^/]*/?): 匹配下一个/ 后跟非/ 字符直到结束。捕获组 #2
$：结束

【讨论】：