【问题标题】:How to scrape id from each div class in rvest?如何从 rvest 中的每个 div 类中刮取 id?
【发布时间】:2018-08-26 00:55:43
【问题描述】:
此页面上的每个 div.grpl-grp clearfix(每个俱乐部元素)都有自己的 id:
https://uws-community.symplicity.com/index.php?s=student_group
我正在尝试抓取每个 ID,但是我当前的方法(如下所示)不起作用。我做错了什么?
url <- 'https://uws-community.symplicity.com/index.php?s=student_group'
page <- html_session(url)
id_nodes <- html_nodes(page, "div.grpl-grp clearfix") %>% html_attrs("id")
【问题讨论】:
标签:
css
r
web-scraping
data-science
rvest
【解决方案1】:
改用 XPath:
library(magrittr)
library(rvest)
doc <- read_html("https://uws-community.symplicity.com/index.php?s=student_group")
html_nodes(doc, xpath=".//div[contains(@class, 'grpl-grp') and contains(@class, 'clearfix')]") %>%
html_attr("id")
## [1] "grpl_5bf9ea61bc46eaeff075cf8043c27c92" "grpl_17e4ea613be85fe019efcf728fb6361d"
## [3] "grpl_d593eb48fe26d58f616515366a1e677b" "grpl_5b445690da34b7cff962ee2bf254db9e"
## [5] "grpl_cd1ebcef22852bdb5301a243803a2909" "grpl_0a7da33f968a919ecfa06486f0787bc7"
## [7] "grpl_a6a6cbf50b45d1ef05f8965c69f462de" "grpl_3fed7efb36173632ae2eef14393f37fc"
## [9] "grpl_f4e1e263109725bd4f99db9f70552b65" "grpl_2be038a5d159bf753fceb26cfdf596c2"
## [11] "grpl_918f9dec53fe5d36c1f98f5136f2ae7d" "grpl_f365b501f1e9833ca0cf8c504e37d11c"
## [13] "grpl_2f302fcce440ec1463beb73c6d7af070" "grpl_26b6771768df4a002e44ad6ec01fa36d"
## [15] "grpl_5e260344fd093628f3326a162996513a" "grpl_3604e5b44c0428dfc982c1bfc852fef2"
## [17] "grpl_9ab9bced3514bd8b2e0e18da8a3c7977" "grpl_6364bed0a4d3f45cd5b1fc929e320cb3"
## [19] "grpl_ba21e3c819afe6a32110585ac379f5d9" "grpl_9964a3732044fceffb4dc9b5645856ba"