由于您试图从整个 HTML 中仅抓取一个元素(如果是这样),因此无需使用 find_all()/findAll() 方法。
相反,您可以使用bs4 提供的find() 或select_one() 方法来抓取一个特定元素或使用CSS 选择器进行选择。您可以使用SelectorGadget 找到css 选择器。
例如:假设您想从 Google 搜索答案框结果中抓取天气数据。
你可以这样做:
- 使用自定义脚本。我又刮了一点,只是为了表明这是一个简单的过程。
- 使用来自 SerpApi 的 Google Direct Answer Box API。这是一个付费 API,可免费试用 5,000 次搜索。查看playground 进行测试。
代码和full example in the online IDE(也适用于其他天气搜索):
from bs4 import BeautifulSoup
import requests, lxml
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
response = requests.get('https://www.google.com/search?q=london weather', headers=headers).text
soup = BeautifulSoup(response, 'lxml')
weather_condition = soup.select_one('#wob_dc').text
tempature = soup.select_one('#wob_tm').text
precipitation = soup.select_one('#wob_pp').text
humidity = soup.select_one('#wob_hm').text
wind = soup.select_one('#wob_ws').text
current_time = soup.select_one('#wob_dts').text
print(f'Weather condition: {weather_condition}\nTempature: {tempature}°F\nPrecipitation: {precipitation}\nHumidity: {humidity}\nWind speed: {wind}\nCurrent time: {current_time}')
# output:
'''
Weather condition: Mostly cloudy
Tempature: 47°F
Precipitation: 79%
Humidity: 49%
Wind speed: 9 mph
Current time: Thursday 10:00 AM
'''
基本上,主要区别在于,通过使用 Google Direct Answer Box API,最终用户的所有操作都已完成,并带有 json 输出,您无需弄清楚内容并修改 HTML 元素即可获得所需输出或猜测为什么输出不同,虽然它应该是完全不同的。
获取天气答案框的代码:
from serpapi import GoogleSearch
import os
params = {
"engine": "google",
"q": "london weather",
"api_key": os.getenv("API_KEY"),
"hl": "en",
}
search = GoogleSearch(params)
results = search.get_dict()
loc = results['answer_box']['location']
weather_date = results['answer_box']['date']
weather = results['answer_box']['weather']
temp = results['answer_box']['temperature']
unit = results['answer_box']['unit']
precipitation = results['answer_box']['precipitation']
humidity = results['answer_box']['humidity']
wind = results['answer_box']['wind']
forecast = results['answer_box']['forecast']
print(f'{loc}\n{weather_date}\n{weather}\n{temp}\n{unit}\n{precipitation}\n{humidity}\n{wind}\n\n{forecast}')
# output:
'''
London, UK
Thursday 7:00 AM
Mostly sunny
53
Fahrenheit
2%
89%
1 mph
[{'day': 'Thursday', 'weather': 'Mostly cloudy', 'temperature': {'high': '70', 'low': '53'}]
...
'''
免责声明,我为 SerpApi 工作。