【问题标题】:Can't parse a Google search result page using BeautifulSoup无法使用 BeautifulSoup 解析 Google 搜索结果页面
【发布时间】:2020-06-19 20:04:51
【问题描述】:

我在 python 中使用来自 bs4 的 BeautifulSoup 解析网页。当我检查谷歌搜索页面的元素时,这是具有第一个结果的部门:

因为它有class = 'r',所以我写了这段代码:

import requests
site = requests.get('https://www.google.com/search?client=firefox-b-d&ei=CLtgXt_qO7LH4-EP6LSzuAw&q=%22narendra+modi%22+%\22scams%22+%\22frauds%22+%\22corruption%22+%22modi%22+-lalit+-nirav&oq=%22narendra+modi%22+%\22scams%22+%\22frauds%22+%\22corruption%22+%22modi%22+-lalit+-nirav&gs_l=psy-ab.3...5077.11669..12032...5.0..0.202.2445.1j12j1......0....1..gws-wiz.T_WHav1OCvk&ved=0ahUKEwjfjrfv94LoAhWy4zgGHWjaDMcQ4dUDCAo&uact=5')
from bs4 import BeautifulSoup
page = BeautifulSoup(site.content, 'html.parser')
results = page.find_all('div', class_="r")
print(results)

但是命令提示符只返回了[]

可能出了什么问题以及如何纠正?

另外,Here's the webpage.

编辑 1: 我通过添加标题字典相应地编辑了我的代码,但结果与 [] 相同。 这是新代码:

import requests
headers = {
    'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0'
}
site = requests.get('https://www.google.com/search?client=firefox-b-d&ei=CLtgXt_qO7LH4-EP6LSzuAw&q=%22narendra+modi%22+%22cams%22+%22frauds%22+%22corruption%22+%22modi%22+-lalit+-nirav&oq=%22narendra+modi%22+%22scams%22+%22frauds%22+%22corruption%22+%22modi%22+-lalit+-nirav&gs_l=psy-ab.3...5077.11669..12032...5.0..0.202.2445.1j12j1......0....1..gws-wiz.T_WHav1OCvk&ved=0ahUKEwjfjrfv94LoAhWy4zgGHWjaDMcQ4dUDCAo&uact=5', headers = headers)
from bs4 import BeautifulSoup
page = BeautifulSoup(site.content, 'html.parser')
results = page.find_all('div', class_="r")
print(results)

注意:当我告诉它打印整个页面时,没有问题,或者当我使用 list(page.children) 时,它工作正常。

【问题讨论】:

  • 您需要将User-Agent 标头作为可选参数传递给requests.get,其中标头是http 请求标头的字典>
  • 所以我应该将第二行更改为:site = requests.get('[page link]', headers = headers) ?
  • 是的,headers 是一个标题字典
  • 我真的不明白你所说的字典是什么意思?一个解释链接,也许?
  • 喜欢{'User-Agent': '[Stuff]'}。另外,您可以在 mozilla 网页上获取 Firefox/Chrome 用户代理

标签: python parsing beautifulsoup google-search


【解决方案1】:

某些网站需要设置User-Agent 标头以防止来自非浏览器的虚假请求。但是,幸运的是,有一种方法可以将标头传递给请求

# Define a dictionary of http request headers
headers = {
  'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0'
} 

# Pass in the headers as a parameterized argument
requests.get(url, headers=headers)

注意:用户代理列表见here

【讨论】:

  • @saumayr 很奇怪。我的工作得很好。尝试打印site.content,看看divs 是否在里面
【解决方案2】:
>>> give_me_everything = soup.find_all('div', class_='yuRUbf')
Prints a bunch of stuff.
>>> give_me_everything_v2 = soup.select('.yuRUbf')
Prints a bunch of stuff.

请注意,您不能这样做:

>>> give_me_everything = soup.find_all('div', class_='yuRUbf').text
AttributeError: You're probably treating a list of elements like a single element.
>>> for all in soup.find_all('div', class_='yuRUbf'):
    print(all.text)
Prints a bunch of stuff.

代码:

from bs4 import BeautifulSoup
import requests

headers = {
    'User-agent':
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)"
    "Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

html = requests.get('https://www.google.com/search?q="narendra modi" "scams" "frauds" "corruption" "modi" -lalit -nirav', headers=headers)
soup = BeautifulSoup(html.text, 'html.parser')

give_me_everything = soup.find_all('div', class_='yuRUbf')
print(give_me_everything)

或者,您可以使用来自 SerpApi 的 Google Search Engine Results API 来做同样的事情。这是一个付费 API,可免费试用 5,000 次搜索。

主要区别在于,当某些东西不工作时,您不必提供不同的解决方案,因此不必维护解析器。

要集成的代码:

from serpapi import GoogleSearch

params = {
  "api_key": "YOUR_API_KEY",
  "engine": "google",
  "q": 'narendra modi" "scams" "frauds" "corruption" "modi" -lalit -nirav',
}

search = GoogleSearch(params)
results = search.get_dict()

for result in results['organic_results']:
    title = result['title']
    link = result['link']
    displayed_link = result['displayed_link']
    print(f'{title}\n{link}\n{displayed_link}\n')

----------
Opposition Corners Modi Govt On Jay Shah Issue, Rafael ...
https://www.outlookindia.com/website/story/no-confidence-vote-opposition-corners-modi-govt-on-jay-shah-issue-rafael-deals-c/313790
https://www.outlookindia.com

Modi, Rahul and Kejriwal describe one another as frauds ...
https://www.business-standard.com/article/politics/modi-rahul-and-kejriwal-describe-one-another-as-frauds-114022400019_1.html
https://www.business-standard.com
...

免责声明,我为 SerpApi 工作。

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2017-10-21
    • 2015-11-23
    • 1970-01-01
    • 1970-01-01
    • 2023-03-29
    • 1970-01-01
    相关资源
    最近更新 更多