Beautiful Soup Python findAll 返回空列表答案

【问题标题】：Beautiful Soup Python findAll returning empty listBeautiful Soup Python findAll 返回空列表
【发布时间】：2020-10-30 04:25:07
【问题描述】：

我正在尝试获取 Amazon Alexa 技能：https://www.amazon.com/PayPal/dp/B075764QCX/ref=sr_1_1?dchild=1&keywords=paypal&qid=1604026451&s=digital-skills&sr=1-1

目前，我只是想获取技能的名称（Paypal），但由于某种原因，这会返回一个空列表。我查看了网站的检查元素，我知道它应该给我这个名字，所以我不确定出了什么问题。我的代码如下：

request = Request(skill_url, headers=request_headers)
response = urlopen(request)
response = response.read()
html = response.decode()
soup = BeautifulSoup(html, 'html.parser')

name = soup.find_all("h1", {"class" : "a2s-title-content"})

【问题讨论】：

标签： python web-scraping beautifulsoup

【解决方案1】：

页面内容是用javascript加载的，所以你不能只用BeautifulSoup来抓取它。您必须使用像 selenium 这样的另一个模块来模拟 javascript 执行。

这是一个例子：

from bs4 import BeautifulSoup as soup
from selenium import webdriver

url='YOUR URL'

driver = webdriver.Firefox()
driver.get(url)

page = driver.page_source
page_soup = soup(page,'html.parser')

containers = page_soup.find_all("h1", {"class" : "a2s-title-content"})
print(containers)
print(len(containers))

您也可以使用chrome-driver 或edge-driver 请参阅here

【讨论】：

【解决方案2】：

尝试设置User-Agent 和Accept-Language HTTP 标头以防止服务器向您发送验证码页面：

import requests
from bs4 import BeautifulSoup

headers = {
    'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:82.0) Gecko/20100101 Firefox/82.0',
    'Accept-Language': 'en-US,en;q=0.5'
}

url = 'https://www.amazon.com/PayPal/dp/B075764QCX/ref=sr_1_1?dchild=1&keywords=paypal&qid=1604026451&s=digital-skills&sr=1-1'

soup = BeautifulSoup(requests.get(url, headers=headers).content, 'lxml')
name = soup.find("h1", {"class" : "a2s-title-content"})
print(name.get_text(strip=True))

打印：

PayPal

【讨论】：

非常感谢，这解决了我所有的问题！