使用beautifulsoup4进行抓取时整理源代码的最佳方法是什么答案

【问题标题】：What are the best ways to sort out source code while scraping using beautifulsoup4使用beautifulsoup4进行抓取时整理源代码的最佳方法是什么
【发布时间】：2016-02-17 11:44:45
【问题描述】：

我目前正在学习 bs4。网上没有什么好的资源。我可以scrape整个源代码，但我如何scrape使用标签来删除特定的url或标题？

【问题讨论】：

您能否更具体地说明您要删除的页面，以及您想具体做什么。
xossip.com/showthread.php?t=1384077 。我正在尝试废弃论坛图像的图像（链接）来源，图标等除外。我需要的所有链接都有“pzy.be”
这是一个与我刚刚回答的this one 非常相似的问题。

标签： python-3.x web-scraping beautifulsoup web-crawler

【解决方案1】：

这应该适合你：

import re
import urllib2
from bs4 import BeautifulSoup

page = urllib2.urlopen('http://www.xossip.com/showthread.php?t=1384077').read()
soup = BeautifulSoup (page, "html5lib")

img = soup.find_all("img", { "src": re.compile('^http://pzy.be') } )

for srcAttr in img:
        print srcAttr['src']

【讨论】：