如何打印google搜索结果的数量（Beautifulsoup）答案

【问题标题】：How to print the number of google search results (Beautifulsoup)如何打印google搜索结果的数量（Beautifulsoup）
【发布时间】：2020-04-06 16:25:55
【问题描述】：

这是我到目前为止所做的事情：

import requests
from bs4 import BeautifulSoup

URL = "https://www.google.com/search?q=programming"
r = requests.get(URL) 

soup = BeautifulSoup(r.content, 'html5lib')

table = soup.find('div', attrs = {'id':'result-stats'}) 

print(table)

我希望它以整数形式获取结果数，即数字 1350000000。

【问题讨论】：

你为什么不用这个pypi.org/project/google-search？
这能回答你的问题吗？ perform a google search and return the number of results
不，它给了我一个错误：需要以下参数：word
你目前得到的输出是什么？

标签： python python-3.x beautifulsoup

【解决方案1】：

您缺少标头 User-Agent，它是一个字符串，用于告诉服务器您正在使用哪种设备访问页面。

import requests
from bs4 import BeautifulSoup

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36"}
URL     = "https://www.google.com/search?q=programming"
result = requests.get(URL, headers=headers)    

soup = BeautifulSoup(result.content, 'html.parser')

total_results_text = soup.find("div", {"id": "result-stats"}).find(text=True, recursive=False) # this will give you the outer text which is like 'About 1,410,000,000 results'
results_num = ''.join([num for num in total_results_text if num.isdigit()]) # now will clean it up and remove all the characters that are not a number .
print(results_num)

【讨论】：

谢谢，这正是我想要的！
我把代码放在for循环中，过了一会儿，我得到了这个错误：AttributeError: 'NoneType' object has no attribute 'find'，你知道原因吗？
@Daniel 我建议更好地帮助您为新问题提出新问题。

【解决方案2】：

这段代码可以解决问题：

import requests
from bs4 import BeautifulSoup

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36"}
result = requests.get("https://www.google.com/search?q=programming", headers=headers)

src = result.content
soup = BeautifulSoup(src, 'lxml')

print(soup.find("div", {"id": "result-stats"}))

【讨论】：

添加标头不会阻止被检测为机器人
@Ahmed Soliman 那它有什么用呢？我会相应地编辑我的答案
User-Agent 是一个字符串，用于告诉服务器您正在使用哪种设备访问页面。如果您使用太多请求使服务器超载，尽管您正在发送标头用户代理，但您将被阻止。

【解决方案3】：

如果您只需要提取一个元素，请使用select_one() bs4 方法。它比find() 更具可读性和速度。 CSS 选择器 reference.

如果您需要非常快速地提取数据，请尝试使用selectolax，它是lexbor HTML Renderer 库的包装，用纯C 编写，没有依赖关系，以及it's fast。

代码和example in the online IDE：

import requests, lxml
from bs4 import BeautifulSoup

headers = {
    "User-Agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

params = {
  "q": "fus ro dah definition",  # query
  "gl": "us",                    # country 
  "hl": "en"                     # language
}

response = requests.get('https://www.google.com/search',
                        headers=headers,
                        params=params)
soup = BeautifulSoup(response.text, 'lxml')

# .previous_sibling will go to, well, previous sibling removing unwanted part: "(0.38 seconds)"
number_of_results = soup.select_one('#result-stats nobr').previous_sibling
print(number_of_results)

# About 107,000 results

或者，您可以使用来自 SerpApi 的 Google Organic Results API 来实现相同的目的。这是一个带有免费计划的付费 API。

您的情况的不同之处在于，您唯一需要做的就是从所需的结构化 JSON 中获取数据，而不是弄清楚如何提取某些元素或如何绕过 Google 的阻止。

import os
from serpapi import GoogleSearch

params = {
    "engine": "google",
    "q": "fus ro dah defenition",
    "api_key": os.getenv("API_KEY"),
}

search = GoogleSearch(params)
results = search.get_dict()

result = results["search_information"]['total_results']
print(result)

# 107000

P.S - 我写了一篇关于如何抓取 Google Organic Results 的博文。

免责声明，我为 SerpApi 工作。

【讨论】：