从输入字段抓取/访问所有搜索结果答案

【问题标题】：Scraping/accessing all search results from input field从输入字段抓取/访问所有搜索结果
【发布时间】：2022-01-01 08:56:05
【问题描述】：

我想使用rvest 抓取https://www.deutsche-biographie.de/。在此网页顶部的输入字段中，必须输入名称。然后相应的搜索结果会显示所有具有此名称或类似名称的人。

比如我输入了名字'Meier'，然后用下面的代码抓取了对应的搜索结果。

library(rvest)
library(dplyr)

page = read_html(x = "https://www.deutsche-biographie.de/search?_csrf=45b6ee54-385e-4777-90bf-9067923e6a00&name=meier")
name = page %>% html_nodes(".media-heading a") %>% html_text()
information = page %>% html_nodes("div.media-body p") %>% html_text()
result = data.frame(name, information)
result$information <- result$information %>% trimws() %>% strsplit(split = ", \\n") %>% lapply(trimws)
result <- tidyr::unnest_wider(result, information) %>%
  rename(years = 2, profession = 3) %>% 
  tidyr::extract(years, into = c("year_of_birth", "year_of_death"), regex = "^.*?(\\d{4}).*?\\-\\s(\\d{4})")

places <- page %>% html_nodes("li.treffer-liste-elem") %>% html_attr("data-orte") %>% strsplit(";")

result$place_of_birth <- lapply(places, function(x) x[grepl("@geburt$", x)]) %>% unlist()
result$place_of_death <- lapply(places, function(x) x[grepl("@tod$", x)]) %>% unlist()
result$place_of_activity <- lapply(places, function(x) x[grepl("@wirk$", x)])

result <- result %>% 
  tidyr::extract(place_of_birth, into = c("place_of_birth", "place_of_birth_coord"), regex = "^(.*?)@(.*?)@.*$") %>% 
  tidyr::extract(place_of_death, into = c("place_of_death", "place_of_death_coord"), regex = "^(.*?)@(.*?)@.*$")

result

这里使用的 URL 是 "https://www.deutsche-biographie.de/search?_csrf=45b6ee54-385e-4777-90bf-9067923e6a00&name=meier"，name=meier 是我手动输入的名称。有没有一种方法可以访问所有名称/搜索结果而不必仅指定一个特定名称？非常感谢您的任何提示！

更新解决方案： 正如@QHarr 所建议的，我插入了一个for循环，通过

循环遍历所有页面

    for (page_result in seq( from = 1, to = 2369 )) {
      link = paste0("https://www.deutsche-biographie.de/search?_csrf=8dc48621-c226-47b9-98b9-a9dfb2ab0ad8&name=*&number=",
                    page_result)
...}

所以整个代码如下

result_total = data.frame()

for (page_result in seq( from = 1, to = 2369 )) {
  link = paste0("https://www.deutsche-biographie.de/search?_csrf=8dc48621-c226-47b9-98b9-a9dfb2ab0ad8&name=*&number=",
                page_result)
  
  download.file(link, destfile = "scrapedpage.html", quiet=TRUE)
  #page = read_html("scrapedpage.html") #to prevent 'error in open.connection(x, "rb") : Timeout was reached'
  page = read_html(link)
  name = page %>% html_nodes(".media-heading a") %>% html_text()
  information = page %>% html_nodes("div.media-body p") %>% html_text()
  result = data.frame(name, information)
  result$information <- result$information %>% trimws() %>% strsplit(split = ", \\n") %>% lapply(trimws)
  result <- tidyr::unnest_wider(result, information) %>%
    rename(years = 2, profession = 3) %>% 
    tidyr::extract(years, into = c("year_of_birth", "year_of_death"), regex = "^.*?(\\d{4}).*?\\-\\s(\\d{4})")
  
  places <- page %>% html_nodes("li.treffer-liste-elem") %>% html_attr("data-orte") %>% strsplit(";")
  
  result$place_of_birth <- lapply(places, function(x) x[grepl("@geburt$", x)])
  result$place_of_death <- lapply(places, function(x) x[grepl("@tod$", x)])
  result$place_of_activity <- lapply(places, function(x) x[grepl("@wirk$", x)])
  
  result <- result %>% 
    tidyr::extract(place_of_birth, into = c("place_of_birth", "place_of_birth_coord"), regex = "^(.*?)@(.*?)@.*$") %>% 
    tidyr::extract(place_of_death, into = c("place_of_death", "place_of_death_coord"), regex = "^(.*?)@(.*?)@.*$")
  
  print(paste("Page:", page_result)) #track the page that R is currently looping over
  result_total <- rbind(result_total, result)
}


result_total <- apply(result_total,2,as.character)

【问题讨论】：

标签： r web-scraping rvest

【解决方案1】：

全部使用“*”运算符。但是，您仍然需要按页面检索结果

https://www.deutsche-biographie.de/search?_csrf=8dc48621-c226-47b9-98b9-a9dfb2ab0ad8&name=*

您可以从初始请求中获取总结果计数，然后，给定结果以 10 个为一组，并且分页反映在 url 中，对所有需要的页面发出请求，以 10 个为一组返回总数. 单个页面看起来像：

第 1 页：

https://www.deutsche-biographie.de/search?_csrf=8dc48621-c226-47b9-98b9-a9dfb2ab0ad8&name=*&number=0

....

第 11 页：

https://www.deutsche-biographie.de/search?_csrf=8dc48621-c226-47b9-98b9-a9dfb2ab0ad8&name=*&number=10

并行发出请求并收集结果。根据所需的请求总数考虑礼貌的等待时间。

【讨论】：

非常感谢 - 它工作得非常好！我更新了包含您建议的链接的代码。