【问题标题】:Scraping/accessing all search results from input field从输入字段抓取/访问所有搜索结果
【发布时间】:2022-01-01 08:56:05
【问题描述】:

我想使用rvest 抓取https://www.deutsche-biographie.de/。在此网页顶部的输入字段中,必须输入名称。然后相应的搜索结果会显示所有具有此名称或类似名称的人。

比如我输入了名字'Meier',然后用下面的代码抓取了对应的搜索结果。

library(rvest)
library(dplyr)

page = read_html(x = "https://www.deutsche-biographie.de/search?_csrf=45b6ee54-385e-4777-90bf-9067923e6a00&name=meier")
name = page %>% html_nodes(".media-heading a") %>% html_text()
information = page %>% html_nodes("div.media-body p") %>% html_text()
result = data.frame(name, information)
result$information <- result$information %>% trimws() %>% strsplit(split = ", \\n") %>% lapply(trimws)
result <- tidyr::unnest_wider(result, information) %>%
  rename(years = 2, profession = 3) %>% 
  tidyr::extract(years, into = c("year_of_birth", "year_of_death"), regex = "^.*?(\\d{4}).*?\\-\\s(\\d{4})")

places <- page %>% html_nodes("li.treffer-liste-elem") %>% html_attr("data-orte") %>% strsplit(";")

result$place_of_birth <- lapply(places, function(x) x[grepl("@geburt$", x)]) %>% unlist()
result$place_of_death <- lapply(places, function(x) x[grepl("@tod$", x)]) %>% unlist()
result$place_of_activity <- lapply(places, function(x) x[grepl("@wirk$", x)])

result <- result %>% 
  tidyr::extract(place_of_birth, into = c("place_of_birth", "place_of_birth_coord"), regex = "^(.*?)@(.*?)@.*$") %>% 
  tidyr::extract(place_of_death, into = c("place_of_death", "place_of_death_coord"), regex = "^(.*?)@(.*?)@.*$")

result

这里使用的 URL 是 "https://www.deutsche-biographie.de/search?_csrf=45b6ee54-385e-4777-90bf-9067923e6a00&amp;name=meier"name=meier 是我手动输入的名称。有没有一种方法可以访问所有名称/搜索结果而不必仅指定一个特定名称? 非常感谢您的任何提示!

更新解决方案: 正如@QHarr 所建议的,我插入了一个for循环,通过

循环遍历所有页面
    for (page_result in seq( from = 1, to = 2369 )) {
      link = paste0("https://www.deutsche-biographie.de/search?_csrf=8dc48621-c226-47b9-98b9-a9dfb2ab0ad8&name=*&number=",
                    page_result)
...}

所以整个代码如下

result_total = data.frame()

for (page_result in seq( from = 1, to = 2369 )) {
  link = paste0("https://www.deutsche-biographie.de/search?_csrf=8dc48621-c226-47b9-98b9-a9dfb2ab0ad8&name=*&number=",
                page_result)
  
  download.file(link, destfile = "scrapedpage.html", quiet=TRUE)
  #page = read_html("scrapedpage.html") #to prevent 'error in open.connection(x, "rb") : Timeout was reached'
  page = read_html(link)
  name = page %>% html_nodes(".media-heading a") %>% html_text()
  information = page %>% html_nodes("div.media-body p") %>% html_text()
  result = data.frame(name, information)
  result$information <- result$information %>% trimws() %>% strsplit(split = ", \\n") %>% lapply(trimws)
  result <- tidyr::unnest_wider(result, information) %>%
    rename(years = 2, profession = 3) %>% 
    tidyr::extract(years, into = c("year_of_birth", "year_of_death"), regex = "^.*?(\\d{4}).*?\\-\\s(\\d{4})")
  
  places <- page %>% html_nodes("li.treffer-liste-elem") %>% html_attr("data-orte") %>% strsplit(";")
  
  result$place_of_birth <- lapply(places, function(x) x[grepl("@geburt$", x)])
  result$place_of_death <- lapply(places, function(x) x[grepl("@tod$", x)])
  result$place_of_activity <- lapply(places, function(x) x[grepl("@wirk$", x)])
  
  result <- result %>% 
    tidyr::extract(place_of_birth, into = c("place_of_birth", "place_of_birth_coord"), regex = "^(.*?)@(.*?)@.*$") %>% 
    tidyr::extract(place_of_death, into = c("place_of_death", "place_of_death_coord"), regex = "^(.*?)@(.*?)@.*$")
  
  print(paste("Page:", page_result)) #track the page that R is currently looping over
  result_total <- rbind(result_total, result)
}


result_total <- apply(result_total,2,as.character)

【问题讨论】:

    标签: r web-scraping rvest


    【解决方案1】:

    全部使用“*”运算符。但是,您仍然需要按页面检索结果

    https://www.deutsche-biographie.de/search?_csrf=8dc48621-c226-47b9-98b9-a9dfb2ab0ad8&name=*
    

    您可以从初始请求中获取总结果计数,然后,给定结果以 10 个为一组,并且分页反映在 url 中,对所有需要的页面发出请求,以 10 个为一组返回总数. 单个页面看起来像:

    第 1 页:

    https://www.deutsche-biographie.de/search?_csrf=8dc48621-c226-47b9-98b9-a9dfb2ab0ad8&amp;name=*&amp;number=0

    ....

    第 11 页:

    https://www.deutsche-biographie.de/search?_csrf=8dc48621-c226-47b9-98b9-a9dfb2ab0ad8&amp;name=*&amp;number=10


    并行发出请求并收集结果。根据所需的请求总数考虑礼貌的等待时间。

    【讨论】:

    • 非常感谢 - 它工作得非常好!我更新了包含您建议的链接的代码。
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2019-12-06
    • 1970-01-01
    • 2018-01-15
    • 2015-10-09
    • 2018-07-31
    • 2021-03-03
    • 2020-01-22
    相关资源
    最近更新 更多