在 python 中清理抓取的 url答案

【问题标题】：Cleaning scraped url in python在 python 中清理抓取的 url
【发布时间】：2016-07-03 09:46:41
【问题描述】：

我正在编写一个网络爬虫来从网站上抓取链接。它工作正常，但输出链接不干净。它输出损坏的 html 链接并检索相同的 html 链接。这是代码

links = re.findall('<a class=.*?href="?\'?([^"\'>]*)', sourceCode)
            for link in links:  
                print link

这就是输出的样子

/preferences?hl=en&someting
/preferences?hl=en&someting
/history/something
/history/something
/support?pr=something
/support?pr=something
http://www.web1.com/parameters
http://www.web1.com/parameters
http://www.web2.com/parameters
http://www.web2.com/parameters

我尝试使用这个正则表达式清理不是 html 的链接

link = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', link)
                    print link

它会清理 url，但会为其添加方括号。如何清理这个以获得没有方括号？我应该如何防止两次或多次打印相同的网址

/preferences?hl=en&someting -> []
http://www.web1.com/parameters -> [http://www.web1.com/parameters]

【问题讨论】：

不是一个解决方案，而是一个提示：如果你仍然使用Python，你最好试试Scrapy，它可以满足所有这些要求（防止重复，建立一个正确的URL，等）。

标签： python regex python-2.7

【解决方案1】：

您在匹配项周围得到[]因为re.findall 返回项目列表

link = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', link)
# pay attention on iteration over set(links) and not links
for link in set(links):
    print link

请注意，我已将 set 创建添加到 for loop 中，以便仅获取唯一链接，这样可以防止打印相同的 url。

【讨论】：

【解决方案2】：

尝试使用

links = re.findall('href="(http.*?)"', sourceCode)
links = sorted(set(links))

for link in links:
    print(links)

这将只获取以http 开头的链接，并删除重复项并对它们进行排序

【讨论】：