使用 Python BS4 无法从页面中找到文本答案

【问题标题】：Can't find text from page using Python BS4使用 Python BS4 无法从页面中找到文本
【发布时间】：2019-08-15 00:06:05
【问题描述】：

我正在尝试学习如何使用 BS4，但遇到了这个问题。我尝试在 Google 搜索结果页面中查找显示搜索结果数量的文本，但在 html_page 和 soup HTML 解析器中都找不到文本“结果”。这是代码：

from bs4 import BeautifulSoup
import requests

url = 'https://www.google.com/search?q=stack'
res = requests.get(url)
html_page = res.content
soup = BeautifulSoup(html_page, 'html.parser')

print(b'results' in html_page)
print('results' in soup)

两个打印都返回False，我做错了什么？如何解决？

编辑：

原来网页的语言有问题，在 URL 中添加&hl=en 几乎解决了它。

url = 'https://www.google.com/search?q=stack&hl=en'

第一个打印现在是True，但第二个仍然是False。

【问题讨论】：

第一个对我有用（第二行通常打印False）。你试过printing html_page吗？那会告诉你的。您可能正在接受验证码。
Google 并不是学习解析 HTML 的好例子。他们过度使用 AJAX 来构建页面，并且有几种反抓取方法。
@Selcuk 是的，我尝试打印页面，它看起来像 HTML 代码
祝你好运。请注意，他们会更改页面，有时甚至一天会更改多次，以使其尽可能难。他们希望你使用他们的 API（并投入一些硬币）。
@GustavoMaia 它总是看起来像 HTML代码。问题是它是否是预期的 HTML 代码。

标签： python beautifulsoup python-requests

【解决方案1】：

requests 库在以response.content 形式返回响应时通常以原始格式返回。因此，要回答您的第二个问题，请将res.content 替换为res.text。

from bs4 import BeautifulSoup
import requests

url = 'https://www.google.com/search?q=stack'
res = requests.get(url)
html_page = res.text
soup = BeautifulSoup(html_page, 'html.parser')

print('results' in soup)

Output: True

请注意，Google 通常会非常积极地处理抓取工具。为避免被阻止/验证码，您可以添加用户代理来模拟浏览器。：

# This is a standard user-agent of Chrome browser running on Windows 10 
headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36' }

例子：

from bs4 import BeautifulSoup
import requests 
headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'}
resp = requests.get('https://www.amazon.com', headers=headers).text 
soup = BeautifulSoup(resp, 'html.parser') 
...
<your code here>

此外，您可以添加另一组标头来伪装成合法的浏览器。添加更多这样的标题：

headers = { 
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36', 
'Accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 
'Accept-Language' : 'en-US,en;q=0.5',
'Accept-Encoding' : 'gzip', 
'DNT' : '1', # Do Not Track Request Header 
'Connection' : 'close'
}

【讨论】：

【解决方案2】：

这不是因为res.content应该像0xInfection提到的那样更改为res.text，它仍然会返回结果。

但是，在某些情况下，它只会将字节内容 if it's not gzip or deflate transfer-encodings, which are automatically decoded by requests 返回为可读格式（在 cmets 中纠正我，如果我错了，请编辑此答案）。

这是因为没有指定 user-agent，因此 Google 最终会阻止请求，因为 default requests user-agent is python-requests 并且 Google 知道这是一个机器人/脚本。详细了解request headers。

将user-agent 传递给请求headers：

headers = {
    "User-Agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

request.get('YOUR_URL', headers=headers)

代码和example in the online IDE：

import requests, lxml
from bs4 import BeautifulSoup

headers = {
    "User-Agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

params = {
  "q": "fus ro dah definition",  # query 
  "gl": "us",                    # country to make request from
  "hl": "en"                     # language
}

response = requests.get('https://www.google.com/search',
                        headers=headers,
                        params=params).content
soup = BeautifulSoup(response, 'lxml')

number_of_results = soup.select_one('#result-stats nobr').previous_sibling
print(number_of_results)

# About 114,000 results

或者，您可以使用来自 SerpApi 的 Google Direct Answer Box API 来实现相同的目的。这是一个带有免费计划的付费 API。

您的情况的不同之处在于，您只需要提取所需的数据，而无需考虑如何提取内容或弄清楚如何绕过 Google 或其他搜索引擎的阻止，因为它已经为最终用户完成。

import os
from serpapi import GoogleSearch

params = {
    "engine": "google",
    "q": "fus ro dah definition",
    "api_key": os.getenv("API_KEY"),
}

search = GoogleSearch(params)
results = search.get_dict()

result = results["search_information"]['total_results']
print(result)

# 112000

免责声明，我为 SerpApi 工作。

【讨论】：