使用 python 正则表达式提取干净的 URL答案

【问题标题】：using python regex to extract clean URLs使用 python 正则表达式提取干净的 URL
【发布时间】：2014-11-19 17:34:53
【问题描述】：

谢谢！我使用了来自post 的@nu11p01n73R 的答案，我得到的主要是 URL，但在开头和结尾仍然有一些额外的“噪音”。理想情况下，我希望它只打印 URL - http://something.some - 因此正则表达式将删除 URL 开头的 <a herf=" 并删除其末尾的 " data-metrics='{"action" : "Click Story 2"}'>。我尝试修改表达式来获得它，但我遇到了 URL 以“开头和结尾”的问题 - 我认为这弄乱了我的正则表达式。有什么建议吗？

URL 像这样嵌入到 .txt 文件中：

<a href="http://www.npr.org/blogs/parallels/2014/11/11/363018388/how-the-islamic-state-wages-its-propaganda-war" data-metrics='{"action":"Click Story 1"}' >

我希望输出是：

http://www.npr.org/blogs/parallels/2014/11/11/363018388/how-the-islamic-state-wages-its-propaganda-war

我最近使用的代码是：

file  = open("/Users/shannonmcgregor/Desktop/npr.txt", 'r')
for line in file:
    if re.search('<a href=[^>]*(islamic|praying|marines|comets|dyslexics)', line):
        print line

但这会返回，例如：

<a href="http://www.npr.org/blogs/parallels/2014/11/11/363018388/how-the-islamic-state-wages-its-propaganda-war" data-metrics='{"action":"Click Story 1"}' >

【问题讨论】：

@AvinashRaj - 漂亮的汤没什么问题（它很漂亮），只是尝试使用正则表达式，因为我需要让它们更舒服，这有助于解决这个问题。
好的，你能发布一个示例以及预期的输出吗？
Regex 不是 HTML 解析的合适工具

标签： python regex

【解决方案1】：

Regex 不是解析 html 文件的正确工具。因为你打算，我发布了这个解决方案。

>>> import re
>>> file  = open("/Users/shannonmcgregor/Desktop/npr.txt", 'r')
>>> for i in file:
        if re.search('<a href="[^>"]*(islamic|praying|marines|comets|dyslexics)', i):
            i = re.sub(r'^.*?<a href="([^"]*)".*', r'\1', i)
            print(i)

或

>>> for i in file:
        if re.search('<a href="[^>"]*(islamic|praying|marines|comets|dyslexics)', i):
            print(re.search(r'^.*?<a href="([^"]*)".*', i).group(1))

【讨论】：

为什么不直接使用 group 而不是 sub
是的。这就是我的意思。

【解决方案2】：

您可以使用re.findall函数将内容提取为

file  = open("/Users/shannonmcgregor/Desktop/npr.txt", 'r')
for line in file:
    if re.search('<a href=[^>]*(islamic|praying|marines|comets|dyslexics)', line):
        print re.findall(r'(?<=")[^"]*(?=")', line)[0]

将产生一个输出为

http://www.npr.org/blogs/parallels/2014/11/11/363018388/how-the-islamic-state-wages-its-propaganda-war

【讨论】：