使用 BeautifulSoup 仅从 blogspot 中提取特定部分的链接答案

【问题标题】：Extract links for certain section only from blogspot using BeautifulSoup使用 BeautifulSoup 仅从 blogspot 中提取特定部分的链接
【发布时间】：2015-09-08 15:05:55
【问题描述】：

我正在尝试仅从 Blogspot 中提取某些部分的链接。但是输出显示代码提取了页面内的所有链接。

代码如下：

import urlparse
import urllib
from bs4 import BeautifulSoup

url = "http://ellywonderland.blogspot.com/"

urls = [url]
visited = [url]

while len(urls) >0:
      try:
          htmltext = urllib.urlopen(urls[0]).read()
      except:
          print urls[0]

      soup = BeautifulSoup(htmltext)

      urls.pop(0)
      print len (urls)

      for tags in soup.find_all(attrs={'class': "post-title entry-title"}):
           for tag in soup.findAll('a',href=True):
                tag['href'] = urlparse.urljoin(url,tag['href'])
                if url in tag['href'] and tag['href'] not in visited:
                    urls.append(tag['href'])
                    visited.append(tag['href'])

print visited

这是我要提取的部分的 html 代码：

<h3 class="post-title entry-title" itemprop="name">
<a href="http://ellywonderland.blogspot.com/2011/02/pre-wedding-vintage.html">Pre-wedding * Vintage*</a>

谢谢。

【问题讨论】：

标签： python beautifulsoup web-crawler

【解决方案1】：

如果您不一定需要使用BeautifulSoup，我认为这样做会更容易：

import feedparser

url = feedparser.parse('http://ellywonderland.blogspot.com/feeds/posts/default?alt=rss')
for x in url.entries:
    print str(x.link)

输出：

http://ellywonderland.blogspot.com/2011/03/my-vintage-pre-wedding.html
http://ellywonderland.blogspot.com/2011/02/pre-wedding-vintage.html
http://ellywonderland.blogspot.com/2010/12/tissue-paper-flower-crepe-paper.html
http://ellywonderland.blogspot.com/2010/12/menguap-menurut-islam.html
http://ellywonderland.blogspot.com/2010/12/weddings-idea.html
http://ellywonderland.blogspot.com/2010/12/kawin.html
http://ellywonderland.blogspot.com/2010/11/vitamin-c-collagen.html
http://ellywonderland.blogspot.com/2010/11/port-dickson.html
http://ellywonderland.blogspot.com/2010/11/ellys-world.html

feedparser 可以解析 blogspot 页面的 RSS 提要，并可以返回您想要的数据，在这种情况下，href 用于帖子标题。

【讨论】：

顺便问一下，为什么它只提取到 25 个链接？
这是 RSS 提要返回的最大链接数
这意味着我不能提取更多？

【解决方案2】：

您需要将 .get 添加到对象中：

打印 Objecta.get('href')

来自http://www.crummy.com/software/BeautifulSoup/bs4/doc/的示例：

for link in soup.find_all('a'):
    print(link.get('href'))

【讨论】：

谢谢。但它仍然会提取 blogspot 页面内的所有链接。我只需要
的 href。

的 href。