【发布时间】:2017-01-07 00:57:27
【问题描述】:
我想使用 rvest 包从 Pro Football Reference 网站获取一些数据。首先,让我们从这个 url http://www.pro-football-reference.com/years/2015/games.htm 获取 2015 年所有游戏的结果
library("rvest")
library("dplyr")
#grab table info
url <- "http://www.pro-football-reference.com/years/2015/games.htm"
urlHtml <- url %>% read_html()
dat <- urlHtml %>% html_table(header=TRUE) %>% .[[1]] %>% as_data_frame()
你会这样做吗? :)
dat 可以稍微清理一下。其中两个变量的名称似乎有空格。加上标题行在每周之间重复。
colnames(dat) <- c("week", "day", "date", "winner", "at", "loser",
"box", "ptsW", "ptsL", "ydsW", "toW", "ydsL", "toL")
dat2 <- dat %>% filter(!(box == ""))
head(dat2)
看起来不错!
现在让我们看一个单独的游戏。在上面的网页上,点击表格第一行中的“Boxscore”:9 月 10 日新英格兰队和匹兹堡队之间的比赛。这会将我们带到这里:http://www.pro-football-reference.com/boxscores/201509100nwe.htm。
我想获取每个玩家的个人快照计数(大约在页面的一半处)。很确定这将是我们的前两行代码:
gameUrl <- "http://www.pro-football-reference.com/boxscores/201509100nwe.htm"
gameHtml <- gameUrl %>% read_html()
但现在我不知道如何获取我想要的特定表。我使用 Selector Gadget 突出显示 Patriots 快照计数表。我通过单击几个位置的表格,然后“取消单击”突出显示的其他表格来做到这一点。我最终得到了一条路径:
#home_snap_counts .right , #home_snap_counts .left, #home_snap_counts .left, #home_snap_counts .tooltip, #home_snap_counts .left
每次尝试都返回{xml_nodeset (0)}
gameHtml %>% html_nodes("#home_snap_counts .right , #home_snap_counts .left, #home_snap_counts .left, #home_snap_counts .tooltip, #home_snap_counts .left")
gameHtml %>% html_nodes("#home_snap_counts .right , #home_snap_counts .left")
gameHtml %>% html_nodes("#home_snap_counts .right")
gameHtml %>% html_nodes("#home_snap_counts")
也许让我们尝试使用xpath。所有这些尝试也返回{xml_nodeset (0)}
gameHtml %>% html_nodes(xpath = '//*[(@id = "home_snap_counts")]//*[contains(concat( " ", @class, " " ), concat( " ", "right", " " ))] | //*[(@id = "home_snap_counts")]//*[contains(concat( " ", @class, " " ), concat( " ", "left", " " ))]//*[(@id = "home_snap_counts")]//*[contains(concat( " ", @class, " " ), concat( " ", "left", " " ))]//*[(@id = "home_snap_counts")]//*[contains(concat( " ", @class, " " ), concat( " ", "tooltip", " " ))]//*[(@id = "home_snap_counts")]//*[contains(concat( " ", @class, " " ), concat( " ", "left", " " ))]')
gameHtml %>% html_nodes(xpath = '//*[(@id = "home_snap_counts")]//*[contains(concat( " ", @class, " " ))]')
gameHtml %>% html_nodes(xpath = '//*[(@id = "home_snap_counts")]')
我怎样才能抓住那张桌子?我还要指出,当我在 Google Chrome 中执行“查看页面源代码”时,我想要的表格似乎几乎被注释掉了?也就是说,它们以绿色输入,而不是通常的红/黑/蓝配色方案。我们首先提取的游戏结果表并非如此。该表的“查看页面源代码”是通常的红/黑/蓝配色方案。绿色是否表明是什么阻止了我获取这张快照计数表?
谢谢!
【问题讨论】:
-
url <- "http://www.pro-football-reference.com/boxscores/201509100nwe.htm#all_vis_snap_counts" snap.count <- url %>% read_html() %>% html_nodes(xpath='//*[contains(concat( " ", @class, " " ), concat( " ", "table_container", " " ))]')返回列表中的一个元素(即{xml_nodeset (1)}),但我似乎无法使用html_table(fill=TRUE)将其转换为表格 -
'http://www.pro-football-reference.com/boxscores/201509100nwe.htm' %>% read_html() %>% html_nodes(xpath = '//comment()') %>% html_text() %>% paste(collapse = '') %>% read_html() %>% html_node('table#home_snap_counts') %>% html_table() %>% {setNames(.[-1, ], paste0(names(.), .[1, ]))} %>% readr::type_convert()
标签: r web-scraping rvest