在 R 中对多个页面进行 Webscraping答案

【问题标题】：Webscraping over multiple pages in R在 R 中对多个页面进行 Webscraping
【发布时间】：2018-07-05 20:20:08
【问题描述】：

我想设置一个循环来自动跨多个页面进行一些网页抓取。到目前为止，这是我一次迭代的代码：

s <- html_session("https://www.hcdn.gob.ar/proyectos/resultados-buscador.html?")
s <- s %>% jump_to("?pagina=5") %>% read_html()
new <- s %>% html_nodes('div.dp-metadata span') %>% html_text()
type.2 <- s %>% html_nodes('h4') %>% html_text()
title <- s %>% html_nodes('div.dp-texto') %>% html_text()



new <- gsub("Iniciado en: ", "", new)
new <- gsub("Fecha: ", "", new)
new <- gsub("Expediente Diputados:", "", new)
new <- gsub("Expediente Senado:", "", new)
new<- new [-c(3, 7, 11, 15, 19, 23, 27, 31, 35, 39, 43, 47, 51, 55, 59, 63, 67, 71, 75, 79)]
chamber <- new[c(1, 4, 7, 10, 13, 16, 19, 22, 25, 28, 31, 34, 37, 40, 43, 46, 49, 52, 55, 58)]
billnum <- new[c(2, 5, 8, 11, 14, 17, 20, 23, 26, 29, 32, 35, 38, 41, 44, 47, 50, 53, 56, 59)]
fecha <- new[c(3, 6, 9, 12, 15, 18, 21, 24, 27, 30, 33, 36, 39, 42, 45, 48, 51, 54, 57, 60)]


new2 <- data.frame(chamber, billnum, fecha, title, type.2)

应该改变的部分是“pagina=”后面的数字。但是，我尝试创建以下简化循环，但它只是返回了一个错误：

new4 <- data.frame(matrix(nrow=40, ncol=2))
colnames(new4) <- c("title", "type")

for (i in 1:2) {
s <- html_session("https://www.hcdn.gob.ar/proyectos/resultados-buscador.html?")
s <- s %>% jump_to("?pagina=", i) %>% read_html()
type.2 <- s %>% html_nodes('h4') %>% html_text()
title <- s %>% html_nodes('div.dp-texto') %>% html_text()



new4[i, 1] <- title
new4[i, 2] <- type.2

}

同样，这个仅抓取我需要的五个功能中的两个的简化循环不起作用并返回错误：f(init, x[[i]]) 中的错误：is.request(y) 不是 TRUE。我想在循环中运行 html_session() 和 jump_to() 命令有问题。我想知道如何在循环中自动执行此操作，以避免手动抓取数千页。

我什至尝试使用 lapply 创建一个向量，但我对我的函数编码不是很有信心，而且我看到的所有模板都是简单的 read_html() 命令，我不太确定我是怎么做的会将 html_session() 和 jump_to() 命令合并到一个函数中。

【问题讨论】：

标签： html r loops web-scraping

【解决方案1】：

你几乎明白了。您代码中的主要问题是jump_to("?pagina=", i)...这应该是jump_to(paste0("?pagina=", i))。这是一个完整的解决方案：

library(rvest)
#> Loading required package: xml2

sesh <- html_session("https://www.hcdn.gob.ar/proyectos/resultados-buscador.html?")

scrape_one_page <- function(sesh, i) {
  one_page <- sesh %>% jump_to(paste0("?pagina=", i)) %>% read_html()
  new <- one_page %>% html_nodes('div.dp-metadata span') %>% html_text()
  type.2 <- one_page %>% html_nodes('h4') %>% html_text()
  title <- one_page %>% html_nodes('div.dp-texto') %>% html_text()

  new <- gsub("Iniciado en: ", "", new)
  new <- gsub("Fecha: ", "", new)
  new <- gsub("Expediente Diputados:", "", new)
  new <- gsub("Expediente Senado:", "", new)
  new <- new [-c(3, 7, 11, 15, 19, 23, 27, 31, 35, 39, 43, 47, 51, 55, 59, 63, 67, 71, 75, 79)]
  chamber <- new[c(1, 4, 7, 10, 13, 16, 19, 22, 25, 28, 31, 34, 37, 40, 43, 46, 49, 52, 55, 58)]
  billnum <- new[c(2, 5, 8, 11, 14, 17, 20, 23, 26, 29, 32, 35, 38, 41, 44, 47, 50, 53, 56, 59)]
  fecha <- new[c(3, 6, 9, 12, 15, 18, 21, 24, 27, 30, 33, 36, 39, 42, 45, 48, 51, 54, 57, 60)]

  data.frame(chamber, billnum, fecha, title, type.2, stringsAsFactors = F)
}

# This call to lapply basically says:
# ...For each element in 1:5 (i.e., 1, 2, 3, 4, 5):
#   ...call scrape_one_page() with the first argument being equal to one of those elements and with the argument `sesh` equal to `sesh`
# ... then put the results in a list

# so it basically results in: 
# outs <- list(
#  scrap_one_page(1, sesh),
#  scrap_one_page(2, sesh),
#  scrap_one_page(3, sesh),
#  scrap_one_page(4, sesh),
#  scrap_one_page(5, sesh)
# )
outs <- lapply(1:5, scrape_one_page, sesh = sesh)

# then here with do.call(rbind) we combine the list of data frames into a single data frame 
df <- do.call(rbind, outs)

# and finally print the first few rows of the data frame
head(df)
#>     chamber      billnum      fecha
#> 1 Diputados  3967-D-2018 29/06/2018
#> 2 Diputados  3966-D-2018 29/06/2018
#> 3 Diputados  3965-D-2018 29/06/2018
#> 4 Diputados  3964-D-2018 29/06/2018
#> 5 Diputados  3963-D-2018 29/06/2018
#> 6 Diputados  3962-D-2018 29/06/2018
#>                                                                                                                                     title
#> 1 SOLICITAR AL PODER EJECUTIVO DISPONGA LAS MEDIDAS NECESARIAS PARA ASEGURAR LA CONTINUIDAD DEL CICLO LECTIVO EN LA PROVINCIA DEL CHUBUT.
#> 2                                                         EXPRESAR REPUDIO POR LA POLITICA INMIGRATORIA DE LOS ESTADOS UNIDOS DE AMERICA.
#> 3                  EXPRESAR REPUDIO POR EL FRAUDE COMETIDO EL 23 DE JUNIO 2018 EN LA "FEDERACION UNIVERSITARIA DE BUENOS AIRES - FUBA -".
#> 4                         EXPRESAR REPUDIO POR LA REPRESION DE FUERZAS POLICIALES Y DE SEGURIDAD, CONTRA LOS TRABAJADORES DE CRESTA ROJA.
#> 5                          PROHIBENSE LOS DESPIDOS DE LA AGENCIA DE NOTICIAS ESTATAL TELAM S.E. POR EL TERMINO DE 24 MESES PRORROGABLES. 
#> 6              DECLARESE LA SEMANA QUE CONTIENE EL 26 DE JUNIO DE CADA AÑO COMO "SEMANA NACIONAL DE LA PREVENCION DEL CONSUMO DE DROGAS".
#>                   type.2
#> 1 PROYECTO DE RESOLUCIÓN
#> 2 PROYECTO DE RESOLUCIÓN
#> 3 PROYECTO DE RESOLUCIÓN
#> 4 PROYECTO DE RESOLUCIÓN
#> 5        PROYECTO DE LEY
#> 6        PROYECTO DE LEY

【讨论】：

非常感谢您的帮助，代码非常简单，完全符合我的需要。不包括 paste0() 命令我觉得很傻。您介意了解一下这段代码的每一步吗，我非常感谢您能完全理解，因为我对编码非常陌生。
大部分代码都是你写的，不是吗？如果是这样，我想你知道这是怎么回事吗？
是的，我做到了。我只是引用了函数的第一部分（scrape_one_page）和最后的 lapply 部分。该函数的主要部分对我来说很有意义，因为那是我编写的原始代码的一部分。
好了，希望对您有所帮助