【发布时间】:2019-11-11 16:54:53
【问题描述】:
我正在开发一个功能来从网站上抓取一些温度数据。它有效,但仅适用于每月的第一天。
这会获取 2015 年第 8 个月和第 8 年的数据。但是它只会抓取第一个表。
如何使用rvest收集当月的所有表格?
https://www.timeanddate.com/weather/spain/madrid/historic?month=8&year=2015
library(rvest)
library(dplyr)
library(purrr)
Temps <- function(month, year){
url <- paste("https://www.timeanddate.com/weather/spain/madrid/historic?month=", month, "&year=",year, sep = "")
temps_obtained <- url %>%
read_html() %>%
html_table(fill = TRUE) %>%
.[[2]] %>%
setNames(.[1,]) %>%
as_tibble(., .name_repair = "universal") %>%
dplyr::slice(., -1) %>%
dplyr::slice(., -n())
return(temps_obtained)
}
map2(.x = 8, .y = 2015, ~Temps(.x, .y))
编辑:我刚刚找到了这个解决方案(适用于 Python):
Scraping table from website [timeanddate.com]
编辑:这是我目前正在使用的,不返回任何数据:
year = 2019
month = 11
day = 3
month = stringr::str_pad(month, width = 2, pad = 0)
day = stringr::str_pad(day, width = 2, pad = 0)
url <- paste("https://www.timeanddate.com/weather/spain/madrid/historic?hd=", year, month, day, sep = "")
temps_obtained <- url %>%
html_session() %>%
read_html() %>%
html_table(fill = TRUE)
编辑:
我认为这解决了问题...
year = 2019
month = 11
day = 3
month = stringr::str_pad(month, width = 2, pad = 0)
day = stringr::str_pad(day, width = 2, pad = 0)
url <- paste("https://www.timeanddate.com/weather/spain/madrid/historic?hd=", year, month, day, sep = "")
temps_obtained <- url %>%
html_session() %>%
read_html() %>%
html_table(fill = TRUE) %>%
.[[2]] %>%
setNames(.[1,]) %>%
as_tibble(., .name_repair = "universal") %>%
dplyr::slice(., -1) %>%
dplyr::slice(., -n())
返回:
# A tibble: 27 x 9
Time ...2 Temp Weather Wind ...6 Humidity Barometer Visibility
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 7:00 amSun, Nov 3 "" 55 °F Passing clouds. 16 mph ↑ 88% "29.62 \"Hg" N/A
2 7:30 am "" 55 °F Passing clouds. 21 mph ↑ 88% "29.62 \"Hg" N/A
3 8:00 am "" 55 °F Broken clouds. 21 mph ↑ 88% "29.62 \"Hg" N/A
4 8:30 am "" 55 °F Broken clouds. 18 mph ↑ 88% "29.65 \"Hg" N/A
5 9:00 am "" 55 °F Drizzle. Broken clouds. 16 mph ↑ 94% "29.68 \"Hg" N/A
6 9:30 am "" 57 °F Broken clouds. 21 mph ↑ 82% "29.71 \"Hg" N/A
7 10:00 am "" 57 °F Broken clouds. 26 mph ↑ 63% "29.71 \"Hg" N/A
8 10:30 am "" 57 °F Scattered clouds. 29 mph ↑ 55% "29.74 \"Hg" N/A
9 11:00 am "" 57 °F Scattered clouds. 17 mph ↑ 55% "29.77 \"Hg" N/A
10 11:30 am "" 59 °F Scattered clouds. 20 mph ↑ 51% "29.77 \"Hg" N/A
将day 更改为4 会给我带来不同的结果。
编辑:不工作
该功能有效,但自 2017 年以来仅持续几天。如果我应用以下内容:它不起作用。
url <- "https://www.timeanddate.com/weather/spain/madrid/historic?hd=20100109"
temps_obtained <- url %>%
html_session() %>%
read_html() %>%
html_node("table") %>%
html_table(fill = TRUE)
这给了我:
1 High
2 Low
3 Average
4 * Reported Oct 27 6:00 pm — Nov 11 6:30 pm, Madrid. Weather by CustomWeather, © 2019
Temperature
1 72 °F (Oct 31, 3:30 pm)
2 39 °F (Nov 8, 8:00 am)
3 56 °F
4 * Reported Oct 27 6:00 pm — Nov 11 6:30 pm, Madrid. Weather by CustomWeather, © 2019
Humidity
1 100% (Oct 29, 7:30 am)
2 36% (Nov 8, 3:00 pm)
3 69%
4 * Reported Oct 27 6:00 pm — Nov 11 6:30 pm, Madrid. Weather by CustomWeather, © 2019
Pressure
1 30.27 "Hg (Oct 29, 7:30 am)
2 29.62 "Hg (Nov 3, 7:00 am)
3 30.00 "Hg
4 * Reported Oct 27 6:00 pm — Nov 11 6:30 pm, Madrid. Weather by CustomWeather, © 2019
这不是我需要的数据。
【问题讨论】:
-
如果您将鼠标悬停在页面底部的链接上,您将看到另一种形式的 URL(带有
?hd=20150801等),它将带您到每一天,这样您就可以重建您的函数来遍历所有这些。 -
您可能需要将其设置为
html_session并使用jump_to,具体取决于它是否允许您直接使用这些网址。 -
你的意思是
pmap... -
省略
html_node("table")-html_table将生成一个表格列表,然后您可以使用%>% .[[2]]选择第二个表格 -
另一种方法 - 当您更改日期时检查网站实际向服务器请求的内容(使用 Chrome 的 Inspect 中的“网络”),您可以使用
https://www.timeanddate.com/scripts/cityajax.php?n=spain/madrid&mode=historic&hd=20091202&month=12&year=2009&json=1形式的 URL 获取 json 表。这是在主要月份页面上填充每日数据表的内容。您可以直接访问这些并通过rjson或jsonlite将它们转换为数据帧。