Beautifulsoup find_all 没有找到所有答案

【问题标题】：Beautifulsoup find_all does not find allBeautifulsoup find_all 没有找到所有
【发布时间】：2015-02-05 10:59:45
【问题描述】：

我目前正在开发一个网络爬虫。我希望我的代码从我抓取的所有网址中获取文本。函数 getLinks() 找到我想要从中获取数据的链接并将它们放入数组中。该数组当前填充了 12 个链接，如下所示： 'http://www.computerstore.nl/product/142504/category-100852/wd-green-wd30ezrx-3-tb.html'

这是我的函数的代码，它使用我从getLinks() 获得的 url 循环遍历我的数组，并从中获取数据。所以我遇到的问题是它有时会返回文本 6 次，有时会返回 8 或 10 次。但不是应有的 12 次。

def getSpecs(): 
    i = 0 
    while (i < len(clinks)):
        r = (requests.get(clinks[i]))
        s = (BeautifulSoup(r.content))
        for item in s.find_all("div", {"class" :"productSpecs roundedcorners"}):
            print item.find('h3')
        i = i + 1 

getLinks()
getSpecs()

我该如何解决这个问题？请帮忙。

提前致谢！

【问题讨论】：

为什么要使用i 或j 和while 循环而不是for url in curl:？
另外，为什么要在覆盖初始分配之前分配item，将其重新用作迭代器？
另外，如果问题出在getSpecs，而不是getLinks，您可以只向我们提供getSpecs和getLinks返回的URL示例，你的问题会更小，更集中。
顺便说一句，让getLinks 返回一个新的 URL 列表而不是修改全局变量会更好。
...就个人而言，我的猜测是，在处理的一种情况下，您得到的 class 不是完全字符串 productSpecs roundedcorners；也许是roundedcorners productSpecs，或者productSpecs roundedcorners somethingElse。

标签： python beautifulsoup web-crawler findall

【解决方案1】：

这是经过多次修复的改进代码：

使用在整个脚本生命周期中维护的requests.Session
使用urparse.urljoin() 加入网址部分
使用CSS selectors 而不是find_all()
改进了在页面上查找产品的方式
将基于索引的循环转换为 pythonic 列表项上的循环

代码：

from urlparse import urljoin

from bs4 import BeautifulSoup
import requests

base_url = 'http://www.computerstore.nl'
curl = ["http://www.computerstore.nl/category/100852/interne-harde-schijven.html?6437=19598"]

session = requests.Session()
for url in curl:
    soup = BeautifulSoup(session.get(url).content)
    links = [urljoin(base_url, item['href']) for item in soup.select("div.product-list a.product-list-item--image-link")]

    for link in links:
        soup = BeautifulSoup(session.get(link).content)
        print soup.find('span', itemprop='name').get_text(strip=True)

它抓取每个产品链接，跟踪它并打印出产品标题（12 个产品）：

WD Red WD20EFRX 2 TB
WD Red WD40EFRX 4 TB
WD Red WD30EFRX 3 TB
Seagate Barracuda ST1000DM003 1 TB
WD Red WD10EFRX 1 TB
Seagate Barracuda ST2000DM001 2 TB
Seagate Barracuda ST3000DM001 3 TB
WD Green WD20EZRX 2 TB
WD Red WD60EFRX 6 TB
WD Green WD40EZRX 4 TB
Seagate NAS HDD ST3000VN000 3 TB
WD Green WD30EZRX 3 TB

【讨论】：