【问题标题】:How to use BeautifulSoup to scrape如何使用 BeautifulSoup 进行刮痧
【发布时间】:2019-04-24 08:32:29
【问题描述】:

脚本的目的是访问一个网站,然后通过 get_attribute 为所有使用 selenium 的产品生成一个链接列表。

使用请求,我访问每个新生成的链接以访问每个产品。然后我尝试使用存储在不同特征变量中的 BeautifulSoup 进行抓取。

我的问题是我相信我试图抓取的某些产品没有我想要抓取的类别,但是我相信它们中的大多数都有。对于没有我正在抓取的存储特征的产品,有没有办法返回类似“N/A”的信息?

这是我的代码:

import time
import csv
from selenium import webdriver
import selenium.webdriver.chrome.service as service
import requests
from bs4 import BeautifulSoup

all_product = []

url = "https://www.vatainc.com/infusion.html?limit=all"
service = service.Service('/Users/Jonathan/Downloads/chromedriver.exe')
service.start()
capabilities = {'chrome.binary': '/Google/Chrome/Application/chrome.exe'}
driver = webdriver.Remote(service.service_url, capabilities)
driver.get(url)
time.sleep(2)
links = [x.get_attribute('href') for x in driver.find_elements_by_xpath("//*[contains(@class, 'product-name')]/a")]

for link in links:
    html = requests.get(link).text
    soup = BeautifulSoup(html, "html.parser")
    products = soup.findAll("html")

    for product in products:
        name = product.find("div", {"class": "product-name"}).text.strip('\n\r\t": ')
        manufacturing_SKU = product.find("span", {"class": "i-sku"}).text.strip('\n\r\t": ')
        manufacturer = product.find("p", {"class": "manufacturer"}).text.strip('\n\r\t": ')
        description = product.find("div", {"class": "std description"}).text.strip('\n\r\t": ')
        included_products = product.find("div", {"class": "included_parts"}).text.strip('\n\r\t": ')
        price = product.find("span", {"class": "price"}).text.strip('\n\r\t": ')
        all_product.append([name, manufacturing_SKU, manufacturer, description, included_products, price])
print(all_product)

这是我的错误代码:

 AttributeError                            Traceback (most recent call last)
<ipython-input-25-36feec64789d> in <module>()
     34         manufacturer = product.find("p", {"class": "manufacturer"}).text.strip('\n\r\t": ')
     35         description = product.find("div", {"class": "std description"}).text.strip('\n\r\t": ')
---> 36         included_products = product.find("div", {"class": "included_parts"}).text.strip('\n\r\t": ')
     37         price = product.find("span", {"class": "price"}).text.strip('\n\r\t": ')
     38         all_product.append([name, manufacturing_SKU, manufacturer, description, included_products, label, price])

AttributeError: 'NoneType' object has no attribute 'text'

【问题讨论】:

    标签: python selenium beautifulsoup python-requests


    【解决方案1】:

    当您的BeautifulSoup 对象上的find() 方法找不到与您的查询匹配的DOM 元素时,它会返回None。具体来说,在 included_products 行上,它找不到类为 included_partsdiv

    在这种情况下,您可以执行以下操作来获得 included_productsNone 值:

    def find_with_class(soup, tag_type, class_name):
        elements = soup.find(tag_type, {'class': class_name})
        if elements:
            return elements.text.strip('\n\r\t": ')
        else:
            return None
    
    included_products = find_with_class(product, 'div', 'included_parts')
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2020-03-14
      • 1970-01-01
      • 1970-01-01
      • 2022-08-20
      • 2013-11-27
      • 1970-01-01
      • 2016-01-25
      • 1970-01-01
      相关资源
      最近更新 更多