【发布时间】:2018-10-05 09:04:00
【问题描述】:
我正在尝试从此website 中提取数据。我有兴趣从draft selections by year 中提取数据。年份范围从 1963 年到 2018 年。
url 结构有一个共同的模式。比如它的https://www.eliteprospects.com/draft/nhl-entry-draft/2018、https://www.eliteprospects.com/draft/nhl-entry-draft/2017等等。
到目前为止,我已经成功提取了一年的数据。我编写了一个自定义函数,其中,给定输入,刮板将收集数据并以漂亮的数据帧格式呈现。
library(rvest)
library (tidyverse)
library (stringr)
get_draft_data<- function(draft_type, draft_year){
# replace the space between words in draft type with a '-'
draft_types<- draft_type %>%
# coerce to tibble format
as.tibble() %>%
set_names("draft_type") %>%
# replace the space between words in draft type with a '-'
mutate(draft_type = str_replace_all(draft_type, " ", "-"))
# create page url
page <- stringr::str_c("https://www.eliteprospects.com/draft/", draft_types, "/", draft_year)%>%
read_html()
# Now scrape the team data from the page
# Extract the team data
draft_team<- page %>%
html_nodes(".team") %>%
html_text()%>%
str_squish() %>%
as_tibble()
# Extract the player data
draft_player<- page %>%
html_nodes("#drafted-players .player") %>%
html_text()%>%
str_squish() %>%
as_tibble()
# Extract the seasons data
draft_season<- page %>%
html_nodes(".seasons") %>%
html_text()%>%
str_squish() %>%
as_tibble()
# Join the dataframe's together.
all_data<- cbind(draft_team, draft_player,draft_season)
return(all_data)
} # end function
# Testing the function
draft_data<-get_draft_data("nhl entry draft", 2011)
glimpse(draft_data)
Observations: 212
Variables: 3
$ value <chr> "Team", "Edmonton Oilers", "Colorado Avalanche", "Florida Panth...
$ value <chr> "Player", "Ryan Nugent-Hopkins (F)", "Gabriel Landeskog (F)", "...
$ value <chr> "Seasons", "8", "8", "7", "8", "6", "8", "8", "8", "7", "7", "3...
问题:如何编写代码使网页 url 中的年份自动递增,从而使爬虫能够提取相关数据并写入数据框。?
【问题讨论】:
标签: web-scraping rvest