【问题标题】:Harvest (rvest) multiple HTML pages from a list of urls从 url 列表中获取 (rvest) 多个 HTML 页面
【发布时间】:2015-05-08 12:27:53
【问题描述】:

我有一个如下所示的数据框:

country <- c("Canada", "US", "Japan", "China")
url <- c("http://en.wikipedia.org/wiki/United_States", "http://en.wikipedia.org/wiki/Canada",
          "http://en.wikipedia.org/wiki/Japan", "http://en.wikipedia.org/wiki/China")
df <- data.frame(country, url)

    country link
1   Canada  http://en.wikipedia.org/wiki/United_States
2   US      http://en.wikipedia.org/wiki/Canada
3   Japan   http://en.wikipedia.org/wiki/Japan
4   China   http://en.wikipedia.org/wiki/China

使用 rvest 我想为每个 url 抓取 目录 并将它们绑定到一个输出。

此代码提取一个 url 的目录:

library(rvest)
toc <- html(url) %>%
  html_nodes(".toctext") %>%
  html_text()

期望的输出:

country toc
US      Etymology
        History
        Native American and European contact
        Settlements
        ...  
Canada  Etymology
        History
        Aboriginal peoples
        European colonization
        ...etc

【问题讨论】:

    标签: r rvest


    【解决方案1】:

    这会将它们刮成一个完整的数据框(每个 TOC 条目一行)。繁琐但直截了当的“打印/输出”代码留给 OP:

    library(rvest)
    library(dplyr)
    
    country <- c("Canada", "US", "Japan", "China")
    url <- c("http://en.wikipedia.org/wiki/United_States", 
             "http://en.wikipedia.org/wiki/Canada",
             "http://en.wikipedia.org/wiki/Japan", 
             "http://en.wikipedia.org/wiki/China")
    df <- data.frame(country, url)
    
    bind_rows(lapply(url, function(x) {
    
      data.frame(url=x, toc_entry=toc <- html(url[1]) %>%
        html_nodes(".toctext") %>%
        html_text())
    
    })) -> toc_entries
    
    df <- toc_entries %>% left_join(df)
    
    df[sample(nrow(df), 10),]
    
    ## Source: local data frame [10 x 3]
    ## 
    ##                                           url                            toc_entry country
    ## 1          http://en.wikipedia.org/wiki/Japan                   Government finance   Japan
    ## 2         http://en.wikipedia.org/wiki/Canada        Cold War and civil rights era      US
    ## 3  http://en.wikipedia.org/wiki/United_States                                 Food  Canada
    ## 4          http://en.wikipedia.org/wiki/Japan                               Sports   Japan
    ## 5         http://en.wikipedia.org/wiki/Canada                             Religion      US
    ## 6          http://en.wikipedia.org/wiki/China        Cold War and civil rights era   China
    ## 7          http://en.wikipedia.org/wiki/Japan Literature, philosophy, and the arts   Japan
    ## 8  http://en.wikipedia.org/wiki/United_States                           Population  Canada
    ## 9          http://en.wikipedia.org/wiki/Japan                          Settlements   Japan
    ## 10        http://en.wikipedia.org/wiki/Canada                             Military      US
    

    【讨论】:

      猜你喜欢
      • 2018-03-20
      • 1970-01-01
      • 1970-01-01
      • 2019-01-13
      • 2019-02-16
      • 1970-01-01
      • 1970-01-01
      • 2016-08-23
      • 1970-01-01
      相关资源
      最近更新 更多