其中一个原因是因为没有指定 user-agent,因此 Google 会阻止您的请求,并且您会收到带有不同选择器的不同 HTML。详细了解HTTP request header。
将user-agent 传递给请求headers:
headers = {
"User-agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
requests.get("YOUR_URL", headers=headers)
只获取第一个链接,use select_one()bs4 方法(css选择器reference):
first_link = soup.select_one(".yuRUbf a")['href']
代码和full example in the online IDE:
from bs4 import BeautifulSoup
import requests, lxml
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "what is the best minecraft skin in 2021", # query
"gl": "uk" # country to search from
}
html = requests.get("https://www.google.com/search", headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
# gets the very first link from the search results
first_link = soup.select_one(".yuRUbf a")['href']
print(first_link)
# https://codakid.com/minecraft-skins/
或者,您可以使用来自 SerpApi 的 Google Organic Results API 来实现相同的目的。这是一个带有免费计划的付费 API。
您的情况的不同之处在于,您不需要弄清楚为什么您没有获得包含所需数据的正确 HTML,因为提取部分已经为最终用户完成。唯一真正需要做的就是迭代结构化 JSON 并获取所需的数据。
要集成的代码:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "best minecraft skin in 2021",
"hl": "en",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
# [0] - first search result
first_link = results['organic_results'][0]['link']
print(first_link)
# https://www.minecraftskins.com/search/skin/2021/1/
免责声明,我为 SerpApi 工作。