使用 BeautifulSoup 从 Google 搜索中抓取网址答案

【问题标题】：Using BeautifulSoup to scrape urls from a Google Search使用 BeautifulSoup 从 Google 搜索中抓取网址
【发布时间】：2019-06-09 19:58:37
【问题描述】：

我的代码是

import urllib
from bs4 import BeautifulSoup
import requests
import webbrowser

text = 'LastName, FirstName'
text = urllib.parse.quote_plus(text)

url = 'https://google.com/search?q=' + text

response = requests.get(url)

我想获取第一个搜索结果的 url，我该怎么做呢？

【问题讨论】：

你的第一个搜索结果可能和别人的不一样；请发布作为您的目标输出的搜索结果。

标签： python-3.x beautifulsoup

【解决方案1】：

您可能需要考虑执行此任务：

import urllib
from bs4 import BeautifulSoup
from selenium import webdriver


text = 'LastName, FirstName'
text = urllib.parse.quote_plus(text)

url = 'https://google.com/search?q=' + text

driver = webdriver.Firefox()
driver.get(url)

soup = BeautifulSoup(driver.page_source)
results_links = soup.find('div',{'class':'srg'}).find_all("a")
print(results_links[0].get('href'))

输出：

https://www.quora.com/What-is-meant-by-first-name-and-last-name

【讨论】：

【解决方案2】：

其中一个原因是因为没有指定 user-agent，因此 Google 会阻止您的请求，并且您会收到带有不同选择器的不同 HTML。详细了解HTTP request header。

将user-agent 传递给请求headers：

headers = {
    "User-agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
requests.get("YOUR_URL", headers=headers)

只获取第一个链接，use select_one()bs4 方法（css选择器reference）：

first_link = soup.select_one(".yuRUbf a")['href']

代码和full example in the online IDE：

from bs4 import BeautifulSoup
import requests, lxml

headers = {
    'User-agent':
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

params = {
  "q": "what is the best minecraft skin in 2021",  # query
  "gl": "uk"                                       # country to search from
}

html = requests.get("https://www.google.com/search", headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')

# gets the very first link from the search results
first_link = soup.select_one(".yuRUbf a")['href']
print(first_link)

# https://codakid.com/minecraft-skins/

或者，您可以使用来自 SerpApi 的 Google Organic Results API 来实现相同的目的。这是一个带有免费计划的付费 API。

您的情况的不同之处在于，您不需要弄清楚为什么您没有获得包含所需数据的正确 HTML，因为提取部分已经为最终用户完成。唯一真正需要做的就是迭代结构化 JSON 并获取所需的数据。

要集成的代码：

import os
from serpapi import GoogleSearch

params = {
    "engine": "google",
    "q": "best minecraft skin in 2021",
    "hl": "en",
    "api_key": os.getenv("API_KEY"),
}

search = GoogleSearch(params)
results = search.get_dict()

# [0] - first search result
first_link = results['organic_results'][0]['link']
print(first_link)

# https://www.minecraftskins.com/search/skin/2021/1/

免责声明，我为 SerpApi 工作。

【讨论】：