无法抓取谷歌结果

【问题标题】：unable to scrape google results无法抓取谷歌结果
【发布时间】：2020-08-31 11:32:46
【问题描述】：

我是 python 新手，我正在向automating boring stuff with python 学习，所以目前我在本书的网络抓取章节中。所以，我只想抓取搜索结果的标题。这是我的代码-

import requests
from bs4 import BeautifulSoup
import webbrowser

term = 'python'
req = requests.get('https://www.google.com/search?q=' + term)
req.raise_for_status()

soup = BeautifulSoup(req.text, 'lxml')
title = soup.find('div', class_ = 'r')

print(title)

问题是它总是返回None。我什至攻击了检查元素工具的屏幕截图，以便您可以看到我正在使用的 div 和 class 名称。

感谢任何帮助谢谢

【问题讨论】：

标签： python web-scraping beautifulsoup python-requests

【解决方案1】：

要从服务器获得正确的响应，请指定User-Agent HTTP 标头：

import requests
from bs4 import BeautifulSoup

headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:79.0) Gecko/20100101 Firefox/79.0'}

term = 'python'
req = requests.get('https://www.google.com/search?q=' + term, headers=headers)
req.raise_for_status()

soup = BeautifulSoup(req.content, 'lxml')
title = soup.find('div', class_ = 'r')

print(title.get_text(strip=True, separator=' '))

打印：

Welcome to Python.org www.python.org www.python.org ...

【讨论】：

非常感谢，它成功了。稍微调整一下，我就得到了我现在想要的东西。但是你能解释一下这个headers 是什么吗？它是如何工作的？
@default-303 headers 是 HTTP 标头：developer.mozilla.org/en-US/docs/Web/HTTP/Headers。 Google 会阻止使用非浏览器用户代理的请求。