【发布时间】:2022-01-01 08:56:05
【问题描述】:
我想使用rvest 抓取https://www.deutsche-biographie.de/。在此网页顶部的输入字段中,必须输入名称。然后相应的搜索结果会显示所有具有此名称或类似名称的人。
比如我输入了名字'Meier',然后用下面的代码抓取了对应的搜索结果。
library(rvest)
library(dplyr)
page = read_html(x = "https://www.deutsche-biographie.de/search?_csrf=45b6ee54-385e-4777-90bf-9067923e6a00&name=meier")
name = page %>% html_nodes(".media-heading a") %>% html_text()
information = page %>% html_nodes("div.media-body p") %>% html_text()
result = data.frame(name, information)
result$information <- result$information %>% trimws() %>% strsplit(split = ", \\n") %>% lapply(trimws)
result <- tidyr::unnest_wider(result, information) %>%
rename(years = 2, profession = 3) %>%
tidyr::extract(years, into = c("year_of_birth", "year_of_death"), regex = "^.*?(\\d{4}).*?\\-\\s(\\d{4})")
places <- page %>% html_nodes("li.treffer-liste-elem") %>% html_attr("data-orte") %>% strsplit(";")
result$place_of_birth <- lapply(places, function(x) x[grepl("@geburt$", x)]) %>% unlist()
result$place_of_death <- lapply(places, function(x) x[grepl("@tod$", x)]) %>% unlist()
result$place_of_activity <- lapply(places, function(x) x[grepl("@wirk$", x)])
result <- result %>%
tidyr::extract(place_of_birth, into = c("place_of_birth", "place_of_birth_coord"), regex = "^(.*?)@(.*?)@.*$") %>%
tidyr::extract(place_of_death, into = c("place_of_death", "place_of_death_coord"), regex = "^(.*?)@(.*?)@.*$")
result
这里使用的 URL 是 "https://www.deutsche-biographie.de/search?_csrf=45b6ee54-385e-4777-90bf-9067923e6a00&name=meier",name=meier 是我手动输入的名称。有没有一种方法可以访问所有名称/搜索结果而不必仅指定一个特定名称?
非常感谢您的任何提示!
更新解决方案: 正如@QHarr 所建议的,我插入了一个for循环,通过
循环遍历所有页面 for (page_result in seq( from = 1, to = 2369 )) {
link = paste0("https://www.deutsche-biographie.de/search?_csrf=8dc48621-c226-47b9-98b9-a9dfb2ab0ad8&name=*&number=",
page_result)
...}
所以整个代码如下
result_total = data.frame()
for (page_result in seq( from = 1, to = 2369 )) {
link = paste0("https://www.deutsche-biographie.de/search?_csrf=8dc48621-c226-47b9-98b9-a9dfb2ab0ad8&name=*&number=",
page_result)
download.file(link, destfile = "scrapedpage.html", quiet=TRUE)
#page = read_html("scrapedpage.html") #to prevent 'error in open.connection(x, "rb") : Timeout was reached'
page = read_html(link)
name = page %>% html_nodes(".media-heading a") %>% html_text()
information = page %>% html_nodes("div.media-body p") %>% html_text()
result = data.frame(name, information)
result$information <- result$information %>% trimws() %>% strsplit(split = ", \\n") %>% lapply(trimws)
result <- tidyr::unnest_wider(result, information) %>%
rename(years = 2, profession = 3) %>%
tidyr::extract(years, into = c("year_of_birth", "year_of_death"), regex = "^.*?(\\d{4}).*?\\-\\s(\\d{4})")
places <- page %>% html_nodes("li.treffer-liste-elem") %>% html_attr("data-orte") %>% strsplit(";")
result$place_of_birth <- lapply(places, function(x) x[grepl("@geburt$", x)])
result$place_of_death <- lapply(places, function(x) x[grepl("@tod$", x)])
result$place_of_activity <- lapply(places, function(x) x[grepl("@wirk$", x)])
result <- result %>%
tidyr::extract(place_of_birth, into = c("place_of_birth", "place_of_birth_coord"), regex = "^(.*?)@(.*?)@.*$") %>%
tidyr::extract(place_of_death, into = c("place_of_death", "place_of_death_coord"), regex = "^(.*?)@(.*?)@.*$")
print(paste("Page:", page_result)) #track the page that R is currently looping over
result_total <- rbind(result_total, result)
}
result_total <- apply(result_total,2,as.character)
【问题讨论】:
标签: r web-scraping rvest