【问题标题】:Scraping linked HTML webpages by looping the rvest::follow_link() function通过循环 rvest::follow_link() 函数来抓取链接的 HTML 网页
【发布时间】:2015-03-04 20:01:47
【问题描述】:

如何循环使用rvest::follow_link() 函数来抓取链接的网页?

用例:

  1. 确定所有乐高电影演员
  2. 关注所有乐高电影演员链接
  3. 为所有演员获取每部电影(+ 年)的表格

我需要的选择器如下:

library(rvest)
lego_movie <- html("http://www.imdb.com/title/tt1490017/")
lego_movie <- lego_movie %>%
  html_nodes(".itemprop , .character a") %>%
  html_text()

# follow cast links
(".itemprop .itemprop") 

# grab tables of all movies and dates for each cast member
(".year_column , b a")

期望的输出:

castMember       movie    year
Will Arnett      Lego     2017
Will Arnett      BoJack   2014
Will Arnett      Wander   2014
        ............
Elizabeth Banks  Moonbeam 2015
Elizabeth Banks  Wet Hot  2015
        ............
Alison Brie      Get Hard 2015
Alison Brie      GetaJob  2015
        .....etc.....

【问题讨论】:

    标签: r web-scraping rvest


    【解决方案1】:

    也许这样的事情可以工作。

    library(rvest)
    library(stringr)
    library(data.table)
    lego_movie <- read_html("http://www.imdb.com/title/tt1490017/")
    cast <- lego_movie %>%
        html_nodes("#titleCast .itemprop span") %>%
        html_text()
    cast
    
    s <- html_session("http://www.imdb.com/title/tt1490017/")
    
    cast_movies <- list()
    
    for(i in cast[1:3]){
        actorpage <- s %>% follow_link(i) %>% read_html()
        cast_movies[[i]]$movies <-  actorpage %>% 
            html_nodes("b a") %>% html_text() %>% head(10)
        cast_movies[[i]]$years <- actorpage %>%
            html_nodes("#filmography .year_column") %>% html_text() %>% 
            head(10) %>% str_extract("[0-9]{4}")
        cast_movies[[i]]$name <- rep(i, length(cast_movies[[i]]$years))
    }
    
    cast_movies
    as.data.frame(cast_movies[[1]])
    rbindlist(cast_movies)
    

    【讨论】:

      【解决方案2】:

      这是未经测试的,所以它可能是错误的。我会一步一步地检查它并验证它是否正确。我不确定如何在这种情况下使用 follow_link ......但这就是我想出的......

      library("rvest")
      library("stringr")
      lego_movie <- html("http://www.imdb.com/title/tt1490017/")
      links <- lego_movie %>%
                  html() %>%
                  html_nodes(".itemprop , a") %>% xml_attr("href")
      links[is.na(links)] <- ""
      
      actors <- lego_movie %>%
        html() %>%
        html_nodes(".itemprop , a") %>%
      html_text()
      
      df <- data.frame(name=actors, link=links, stringsAsFactors=F)
      df <- subset(df, substring(link, 2, 5)=="name")
      df <- subset(df, name!="")
      df$name <- gsub("\\n", "", df$name)
      df$name <- str_trim(df$name)
      df <- df[order(df$name),]
      df <- subset(df, !duplicated(df$name))
      
      get_movies <- function(name, link){
        url <- paste0("http://www.imdb.com", link)
        movies <- url %>%
          html() %>%
          html_nodes(".year_column , b a") %>%
          html_text()
        # take care of random date at top of some actors stuff...
        if(length(movies)%%2==1){movies <- movies[-1]}
        movies <- gsub("\\n", "", movies)
        movies <- str_trim(movies)
        df <- data.frame(date=movies[seq(1, length(movies), 2)], 
                         movie=movies[seq(2, length(movies), 2)],
                         stringsAsFactors=F)
        df <- cbind(name=rep(name, nrow(df)), df)
        return(df)
      }
      
      final_df <- data.frame()
      for(i in 1:nrow(df)){
        final_df <- rbind(final_df, get_movies(df$name[i], df$link[i]))
      }
      

      【讨论】:

      • 感谢您的尝试,但这并不能真正回答问题。根据 rvest 文档 (cran.r-project.org/web/packages/rvest/rvest.pdf),follow_link() 函数似乎应该大大减少代码。但是文档示例不够明确,无法适应其他情况......
      • 我道歉,下次会努力的。
      • 我也许可以调整代码以供我使用...但我希望我们能看到如何利用 follow_link 函数!
      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2018-03-20
      • 1970-01-01
      • 1970-01-01
      • 2019-07-21
      • 1970-01-01
      • 1970-01-01
      • 2021-12-25
      相关资源
      最近更新 更多