【问题标题】:How to scrape id from each div class in rvest?如何从 rvest 中的每个 div 类中刮取 id?
【发布时间】:2018-08-26 00:55:43
【问题描述】:

此页面上的每个 div.grpl-grp clearfix(每个俱乐部元素)都有自己的 id:

https://uws-community.symplicity.com/index.php?s=student_group

我正在尝试抓取每个 ID,但是我当前的方法(如下所示)不起作用。我做错了什么?

url <- 'https://uws-community.symplicity.com/index.php?s=student_group'
page <- html_session(url)

id_nodes <- html_nodes(page, "div.grpl-grp clearfix") %>% html_attrs("id")

【问题讨论】:

    标签: css r web-scraping data-science rvest


    【解决方案1】:

    改用 XPath:

    library(magrittr)
    library(rvest)
    
    doc <- read_html("https://uws-community.symplicity.com/index.php?s=student_group")
    
    html_nodes(doc, xpath=".//div[contains(@class, 'grpl-grp') and contains(@class, 'clearfix')]") %>% 
      html_attr("id")
    ##  [1] "grpl_5bf9ea61bc46eaeff075cf8043c27c92" "grpl_17e4ea613be85fe019efcf728fb6361d"
    ##  [3] "grpl_d593eb48fe26d58f616515366a1e677b" "grpl_5b445690da34b7cff962ee2bf254db9e"
    ##  [5] "grpl_cd1ebcef22852bdb5301a243803a2909" "grpl_0a7da33f968a919ecfa06486f0787bc7"
    ##  [7] "grpl_a6a6cbf50b45d1ef05f8965c69f462de" "grpl_3fed7efb36173632ae2eef14393f37fc"
    ##  [9] "grpl_f4e1e263109725bd4f99db9f70552b65" "grpl_2be038a5d159bf753fceb26cfdf596c2"
    ## [11] "grpl_918f9dec53fe5d36c1f98f5136f2ae7d" "grpl_f365b501f1e9833ca0cf8c504e37d11c"
    ## [13] "grpl_2f302fcce440ec1463beb73c6d7af070" "grpl_26b6771768df4a002e44ad6ec01fa36d"
    ## [15] "grpl_5e260344fd093628f3326a162996513a" "grpl_3604e5b44c0428dfc982c1bfc852fef2"
    ## [17] "grpl_9ab9bced3514bd8b2e0e18da8a3c7977" "grpl_6364bed0a4d3f45cd5b1fc929e320cb3"
    ## [19] "grpl_ba21e3c819afe6a32110585ac379f5d9" "grpl_9964a3732044fceffb4dc9b5645856ba"
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2019-10-09
      • 1970-01-01
      • 1970-01-01
      • 2018-01-03
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多