从网站 [timeanddate.com] 抓取表格答案

【问题标题】：Scraping table from website [timeanddate.com]从网站 [timeanddate.com] 抓取表格
【发布时间】：2019-01-16 07:32:40
【问题描述】：

我想从https://www.timeanddate.com/获取历史每小时天气数据

这是网站链接：https://www.timeanddate.com/weather/usa/dayton/historic?month=2&year=2016 - 这里我选择的是2016年2月和2016年，结果会出现在页面底部。

我使用了以下步骤：https://stackoverflow.com/a/47280970/9341589

它在“每个月的第一天”运行良好，我想解析整个月，如果可能的话，全年都可以。

在我使用的代码下方（解析 2016 年 3 月 1 日）：

from urllib.request import urlopen
from bs4 import BeautifulSoup
url = "https://www.timeanddate.com/weather/usa/dayton/historic?month=3&year=2016"
page = urlopen(url)
soup = BeautifulSoup(page, "html.parser")

Data = []
table = soup.find('table', attrs={'id':'wt-his'})
for tr in table.find('tbody').find_all('tr'):
   dict = {}
   dict['time'] = tr.find('th').text.strip()
   all_td = tr.find_all('td')
   dict['temp'] = all_td[1].text
   dict['weather'] = all_td[2].text
   dict['wind'] = all_td[3].text
   arrow = all_td[4].text


   dict['humidity'] = all_td[5].text
   dict['barometer'] = all_td[6].text
   dict['visibility'] = all_td[7].text

   Data.append(dict)

这是 3 月 1 日的结果：

这是因为网站“url”，链接只包含月份和年份，而要更改日期，例如从2月1日到2月3日，需要使用标签如附图所示：@ 987654325@

【问题讨论】：

标签： python python-3.x web-scraping beautifulsoup html-parsing

【解决方案1】：

您可以遍历单个页面的表格元素（tr、th 和 td）：

import requests, re, typing
from bs4 import BeautifulSoup as soup
import contextlib
def _remove(d:list) -> list:
   return list(filter(None, [re.sub('\xa0', '', b) for b in d]))

@contextlib.contextmanager
def get_weather_data(url:str, by_url = True) -> typing.Generator[dict, None, None]:
   d = soup(requests.get(url).text if by_url else url, 'html.parser')
   _table = d.find('table', {'id':'wt-his'})
   _data = [[[i.text for i in c.find_all('th')], *[i.text for i in c.find_all('td')]] for c in _table.find_all('tr')]
   [h1], [h2], *data, _ = _data
   _h2 = _remove(h2)
   yield {tuple(_remove(h1)):[dict(zip(_h2, _remove([a, *i]))) for [[a], *i] in data]}


with get_weather_data('https://www.timeanddate.com/weather/usa/dayton/historic?month=2&year=2016') as weather:
 print(weather)

输出：

{('Conditions', 'Comfort'): [{'Time': '12:58 amMon, Feb 1', 'Temp': '50°F', 'Weather': 'Light rain. Mostly cloudy.', 'Wind': '13 mph', 'Humidity': '↑', 'Barometer': '88%', 'Visibility': '29.79 "Hg'}, {'Time': '1:58 am', 'Temp': '46°F', 'Weather': 'Mostly cloudy.', 'Wind': '12 mph', 'Humidity': '↑', 'Barometer': '83%', 'Visibility': '29.82 "Hg'}, {'Time': '2:58 am', 'Temp': '43°F', 'Weather': 'Mostly cloudy.', 'Wind': '14 mph', 'Humidity': '↑', 'Barometer': '85%', 'Visibility': '29.87 "Hg'}, {'Time': '3:58 am', 'Temp': '42°F', 'Weather': 'Mostly cloudy.', 'Wind': '10 mph', 'Humidity': '↑', 'Barometer': '83%', 'Visibility': '29.89 "Hg'}, {'Time': '4:58 am', 'Temp': '41°F', 'Weather': 'Mostly cloudy.', 'Wind': '10 mph', 'Humidity': '↑', 'Barometer': '82%', 'Visibility': '29.91 "Hg'}, {'Time': '5:58 am', 'Temp': '39°F', 'Weather': 'Mostly cloudy.', 'Wind': '8 mph', 'Humidity': '↑', 'Barometer': '83%', 'Visibility': '29.93 "Hg'}, {'Time': '6:58 am', 'Temp': '38°F', 'Weather': 'Partly cloudy.', 'Wind': '5 mph', 'Humidity': '↑', 'Barometer': '82%', 'Visibility': '29.96 "Hg'}, {'Time': '7:58 am', 'Temp': '38°F', 'Weather': 'Partly sunny.', 'Wind': '5 mph', 'Humidity': '↑', 'Barometer': '80%', 'Visibility': '29.99 "Hg'}, {'Time': '8:58 am', 'Temp': '38°F', 'Weather': 'Overcast.', 'Wind': '5 mph', 'Humidity': '↑', 'Barometer': '78%', 'Visibility': '30.01 "Hg'}, {'Time': '9:58 am', 'Temp': '40°F', 'Weather': 'Broken clouds.', 'Wind': '7 mph', 'Humidity': '↑', 'Barometer': 'N/A', 'Visibility': '30.01 "Hg'}, {'Time': '10:58 am', 'Temp': '41°F', 'Weather': 'Broken clouds.', 'Wind': '1 mph', 'Humidity': '↑', 'Barometer': '72%', 'Visibility': '30.02 "Hg'}, {'Time': '11:58 am', 'Temp': '41°F', 'Weather': 'Partly sunny.', 'Wind': '2 mph', 'Humidity': '↑', 'Barometer': '70%', 'Visibility': '30.04 "Hg'}, {'Time': '12:58 pm', 'Temp': '42°F', 'Weather': 'Scattered clouds.', 'Wind': '2 mph', 'Humidity': '↑', 'Barometer': '69%', 'Visibility': '30.04 "Hg'}, {'Time': '1:58 pm', 'Temp': '43°F', 'Weather': 'Partly sunny.', 'Wind': '3 mph', 'Humidity': '↑', 'Barometer': '65%', 'Visibility': '30.03 "Hg'}, {'Time': '2:58 pm', 'Temp': '44°F', 'Weather': 'Partly sunny.', 'Wind': 'No wind', 'Humidity': '↑', 'Barometer': '62%', 'Visibility': '30.02 "Hg'}, {'Time': '3:58 pm', 'Temp': '46°F', 'Weather': 'Passing clouds.', 'Wind': '6 mph', 'Humidity': '↑', 'Barometer': '58%', 'Visibility': '30.03 "Hg'}, {'Time': '4:58 pm', 'Temp': '46°F', 'Weather': 'Sunny.', 'Wind': '6 mph', 'Humidity': '↑', 'Barometer': '57%', 'Visibility': '30.04 "Hg'}, {'Time': '5:58 pm', 'Temp': '43°F', 'Weather': 'Clear.', 'Wind': '3 mph', 'Humidity': '↑', 'Barometer': '65%', 'Visibility': '30.06 "Hg'}, {'Time': '6:58 pm', 'Temp': '39°F', 'Weather': 'Clear.', 'Wind': '1 mph', 'Humidity': '↑', 'Barometer': '71%', 'Visibility': '30.09 "Hg'}, {'Time': '7:58 pm', 'Temp': '35°F', 'Weather': 'Clear.', 'Wind': '1 mph', 'Humidity': '↑', 'Barometer': '79%', 'Visibility': '30.11 "Hg'}, {'Time': '8:58 pm', 'Temp': '32°F', 'Weather': 'Clear.', 'Wind': 'No wind', 'Humidity': '↑', 'Barometer': '85%', 'Visibility': '30.13 "Hg'}, {'Time': '9:58 pm', 'Temp': '30°F', 'Weather': 'Clear.', 'Wind': 'No wind', 'Humidity': '↑', 'Barometer': '91%', 'Visibility': '30.14 "Hg'}, {'Time': '10:58 pm', 'Temp': '28°F', 'Weather': 'Clear.', 'Wind': '5 mph', 'Humidity': '↑', 'Barometer': '93%', 'Visibility': '30.14 "Hg'}, {'Time': '11:58 pm', 'Temp': '29°F', 'Weather': 'Clear.', 'Wind': 'No wind', 'Humidity': '↑', 'Barometer': '90%', 'Visibility': '30.13 "Hg'}]}

但是，为了抓取所需月份中所有日期的数据，必须使用 selenium，因为该站点通过对后端的请求动态更新 DOM：

from selenium import webdriver
d = webdriver.Chrome('/Path/to/chromedriver')
d.get('https://www.timeanddate.com/weather/usa/dayton/historic?month=2&year=2016')
_d = {}
for i in d.find_element_by_id('wt-his-select').find_elements_by_tag_name('option'):
  i.click()
  with get_weather_data(d.page_source, False) as weather:
    _d[i.text] = weather

编辑：要遍历完整的数据结果，请使用dict.items：

for a, b in _d.items():
  pass #do something with a and b

【讨论】：

我仍然在使用您的代码（TypeError: 'str' object is not callable），感谢您的帮助
重新运行代码后我得到：TypeError：泛型类型的参数必须是类型。得到 {'time': '11:58 pm', 'temp': '27\xa0°F', 'weather': '部分多云。', 'wind': '21 mph', '湿度': .
@KingJulien Strange，我在运行此代码时没有收到该错误。我正在使用 Python 3.7。您在哪个版本上运行此代码？
@KingJulien 这也是我收到的。它如何匹配您想要的输出？
@KingJulien 您需要从这里安装 chromedriver：chromedriver.chromium.org/downloads 并将指向安装的路径传递给 Chrome

【解决方案2】：

使用 chrome 中的开发人员工具，您似乎可以使用 driver.find_element_by_link_text(date_here).click() 搜索并单击带有文本 first_three_letters_of_month day 的链接

【讨论】：

如果您能给我提供解决方案以及在哪里添加它，我将不胜感激，我是 HTML 解析和编码的初学者。
如果您能告诉我如何实现它，我将不胜感激