Rvest 中的多个页面答案

【问题标题】：Multiple pages in RvestRvest 中的多个页面
【发布时间】：2019-01-13 08:21:05
【问题描述】：

我正在使用 R 中的 Rvest 进行网页抓取。我试图从有 12 页的搜索页面中获取数据。我编写了一个代码来迭代页面以从每个页面收集数据。但我的代码只重复收集第一页。这是我的代码示例。

# New method for Pagination
url_base <- "https://www.nhs.uk/service-search/Hospital/LocationSearch/7/ConsultantResults?SortBy=1&Distance=400&ResultsPerPage=10&Name=e.g.%20Singh%20or%20John%20Smith&Specialty=230&Location.Id=0&Location.Name=e.g.%20postcode%20or%20town&Location.Longitude=0&Location.Latitude=0&CurrentPage=1&OnlyViewConsultantsWithOutcomeData=False"
map_df(1:12, function(i) {
  cat(".")
  pg <- read_html(sprintf(url_base,i))
  data.frame(consultant_name = html_text(html_nodes(pg,".consultants-list h2 a")))
  
}) -> names

dplyr::glimpse(names)

代码的编辑版本：

# New method for Pagination
url_base  <-  "https://www.nhs.uk/service-search/Hospital/LocationSearch/7/ConsultantResults?ResultsPerPage=100&defaultConsultantName=e.g.+Singh+or+John+Smith&DefaultLocationText=e.g.+postcode+or+town&DefaultSearchDistance=25&Name=e.g.+Singh+or+John+Smith&Specialty=230&Location.Name=e.g.+postcode+or+town&Location.Id=0&CurrentPage=%d"
map_df(1:12, function(i) {
  cat(".")
  pg <- read_html(sprintf(url_base,i))
  data.frame(consultant_name = html_text(html_nodes(pg,".consultants-list h2 a")),
             gmc_no = gsub("GMC membership number: ","",html_text(html_nodes(pg,".consultants-list .name-number p"))),
             Speciality = html_text(html_nodes(pg,".consultants-list .specialties ul li")),
             location = html_text(html_nodes(pg,".consultants-list .consultant-services ul li")),stringsAsFactors=FALSE)
  
}) -> names

dplyr::glimpse(names)

上面的代码接受 8 个循环来获取 800 行，即每页 100 行，但随后会出现错误。

............ data.frame 中的错误（consultant_name = html_text（html_nodes（pg，“.consultants-list h2 a”）），：参数暗示不同的行数：100、101 调用自： data.frame（顾问名称= html_text（html_nodes（pg， ".consultants-list h2 a")), gmc_no = gsub("GMC会员号：", "", html_text(html_nodes(pg, ".consultants-list .name-number p"))), 专业 = html_text(html_nodes(pg, ".consultants-list .specialties ul li")), location = html_text(html_nodes(pg, ".consultants-list .consultant-services ul li")), stringsAsFactors = FALSE) 浏览[1]>

我尝试更改循环编号，但没有成功。

请帮我解决这个问题！！！

【问题讨论】：

您没有在 url 中指定您请求的页面，网站如何知道您需要下一个？查看分页网址，您会看到发生了什么变化。跟随它，或者抓住“下一个”锚点下的网址，直到你到达终点（没有“下一个”锚点）。
@LukasS 我在我的网址中提到过，它从第 1 页开始。网址链接“nhs.uk/service-search/Hospital/LocationSearch/7/…”
我看得很清楚，但是你的网址有误。查看分页链接。
@LukasS 我已经更新了我的问题中的 URL 链接。仍然页面没有阅读。这是我得到的 URL 链接，不要告诉你你指的是哪个链接。
您需要在成功响应后增加适当的值，或者更简单：使用'Next'从锚中提取url，然后您将满足两个条件（计算下一个值并检查是否有更多页面爬行）。

标签： r web-scraping pagination dplyr rvest

【解决方案1】：

这是我在查看 URL 的模式后得出的结论。

library(tidyverse)
library(rvest)

base_url <- "https://www.nhs.uk/service-search/Hospital/LocationSearch/7/ConsultantResults?Specialty="

# change the code to pull other specialities
specialty_code = 230 # ie. Anaesthesia services = 230

# show 100 per page    
tgt_url <- str_c(base_url,specialty_code,"&ResultsPerPage=100&CurrentPage=")

pg <- read_html(tgt_url)

# count the total results and set the page count
res_cnt <- pg %>% html_nodes('.fcresultsinfo li:nth-child(1)') %>% html_text() %>% str_remove('.* of ') %>% as.numeric()
pg_cnt = ceiling(res_cnt / 100)

res_all <- NULL
for (i in 1:pg_cnt) {

pg <- read_html(str_c(tgt_url,i))
res_pg <- tibble(
            consultant_name = pg %>% html_nodes(".consultants-list h2 a") %>% html_text(),
            gmc_no = pg %>% html_nodes(".consultants-list .name-number p") %>% html_text() %>% 
                            str_remove("GMC membership number: "),
            speciality = pg %>% html_nodes(".consultants-list .specialties ul") %>% 
                                html_text() %>% str_replace_all(', \r\n\\s+',', ') %>% str_trim(),
            location = pg %>% html_nodes(".consultants-list .consultant-services ul") %>%
                              html_text() %>% str_replace_all(', \r\n\\s+',', ') %>% str_trim(),
            src_link = pg %>% html_nodes(".consultants-list h2 a") %>% html_attr('href')
            ) 

res_all <- res_all %>% bind_rows(res_pg)

}

这是我得到的：

> nrow(res_all)
## [1] 1141
> res_all %>% select(1:4) %>% tail()
## # A tibble: 6 x 4
##  consultant_name      gmc_no  speciality           location                                        
##  <chr>                <chr>   <chr>                <chr>                                           
## 1 Mark Yeates          4716345 Anaesthesia services The Great Western Hospital                      
## 2 Steven Yentis        2939700 Anaesthesia services Chelsea and Westminster Hospital                
## 3 Louise Young         6139457 Anaesthesia services Southampton General Hospital                    
## 4 Andreas Zafiropoulos 6075484 Anaesthesia services Shrewsbury and Telford Hospital NHS Trust       
## 5 Suhail Zaidi         4239598 Anaesthesia services Luton and Dunstable Hospital                    
## 6 Cezary Zugaj         4751331 Anaesthesia services Oxford University Hospitals NHS Foundation Trust

【讨论】：