Rvest：为多个 html 链接构建队列答案

【问题标题】：Rvest: Building a queue for several html-linksRvest：为多个 html 链接构建队列
【发布时间】：2020-08-31 20:49:11
【问题描述】：

我目前正在对新闻杂志进行网络抓取，但不幸的是，我不知道如何建立工作队列。我只能在一个页面上抓取所有文章的内容，但我想要一个队列，它会自动对其余文章执行相同的操作。

library(rvest)
library(tidyverse)
library(data.table)
library(plyr)
library(writexl)


map_dfc(.x = c("em.entrylist__title", "time.entrylist__time"),
        .f = function(x) {read_html("https://www.sueddeutsche.de/news/page/1?search=Corona&sort=date&all%5B%5D=dep&all%5B%5D=typ&all%5B%5D=sys&time=2020-07-19T00%3A00%2F2020-07-27T23%3A59&startDate=27.07.2020&endDate=01.08.2020") %>% 
            html_nodes(x) %>% 
            html_text()}) %>%
  bind_cols(url = read_html("https://www.sueddeutsche.de/news/page/1?search=Corona&sort=date&all%5B%5D=dep&all%5B%5D=typ&all%5B%5D=sys&time=2020-07-19T00%3A00%2F2020-07-27T23%3A59&startDate=27.07.2020&endDate=01.08.2020") %>% 
              html_nodes("a.entrylist__link") %>% 
              html_attr("href")) %>% 
  setNames(nm = c("title", "time", "url")) -> temp

map_df(.x = temp$url[1:50],
       .f = function(x){tibble(url = x,
                               text = read_html(x) %>% 
                                 html_nodes("#article-app-container > article > div.css-isuemq.e1lg1pmy0 > p:nth-child(n)") %>% 
                                 html_text() %>% 
                                 list
       )}) %>% 
  unnest(text) -> foo

foo

X2 <- ddply(foo, .(url), summarize,
            Xc=paste(text,collapse=","))

final <- merge(temp, X2, by="url")

在这种情况下，我得到了 30 页的文章，但我的脚本只支持一页的抓取。页面之间唯一变化的是页码 (https://www.sueddeutsche.de/news/**page/1**?search=...)

如果您能告诉我如何一次将所有页面包含到队列中，我将不胜感激。非常感谢:)

【问题讨论】：

提示：lapply，或for循环

标签： r web-scraping rvest

【解决方案1】：

数据框形式的队列如何为您工作？
以下建议更为通用，因此它可以在特定用例之外发挥作用。您可以随时添加更多要抓取的网址，但由于dplyr::distinct，只会保留新的个。
（我已经启动队列来保存您要抓取的前 5 个页面，如果您在 DOM 上找到链接，您可以立即添加更多或动态添加...）

library(dplyr)
library(lubridate)

queue <- tibble(
  url = paste0("https://www.sueddeutsche.de/news/page/", 1:5, "?search=Corona&sort=date&all%5B%5D=dep&all%5B%5D=typ&all%5B%5D=sys&time=2020-07-19T00%3A00%2F2020-07-27T23%3A59&startDate=27.07.2020&endDate=01.08.2020"),
  scraped_time = lubridate::NA_POSIXct_
)

results <- list()

while(length(open_rows <- which(is.na(queue$scraped_time))) > 0) {
  i <- open_rows[1]
  url <- queue$url[i]

  [...]
  results[[url]] <- <YOUR SCRAPING RESULT>
  
  queue$scraped_time[i] <- lubridate::now()
  
  if (<MORE PAGES TO QUEUE>) {
    queue <- queue %>%
      tibble::add_row(url = c('www.spiegel.de', 'www.faz.de')) %>%
      arrange(desc(scraped_time)) %>%
      distinct(url, .keep_all = T)
  }
}

【讨论】：

首先，非常感谢！为了让循环工作，我需要插入一个变量而不是相同的 html ("sueddeutsche.de/news/page/…) 在这种情况下我使用哪个变量， url[i] 用于索引？提前致谢:)
如果您提前知道想要的页数，只需调整paste0 语句中的1:5。它将通过插入您指定的页码来创建 URL 向量