【问题标题】:Obtain URLs from search using rvest使用 rvest 从搜索中获取 URL
【发布时间】:2019-09-20 09:54:14
【问题描述】:

我正在尝试从尼日利亚各省获取用于搜索西联汇款网站的 URL。特别是,我想在webpage 之后搜索省份向量,并为每次搜索保留相应的 URL,然后对每个获得的链接进行网络抓取。我知道如何做第二步,但不知道第一步。特别是,我的代码是

#install.packages("selectr")
#install.packages("xml2")
library(selectr)
library(xml2)
library(rvest)
library(xlsx)
provinces = as.vector(read.xlsx("provinces.xls", 1)[,1])

URL <- "https://locations.westernunion.com/search/nigeria/"
webpage <- read_html(URL)

但是现在我不知道如何从前面提到的向量中搜索和存储我的每个省份的 URL。

【问题讨论】:

    标签: r web screen-scraping rvest


    【解决方案1】:

    我们可以得到div标签的"Nigeria""Nigeria"结尾的href属性,类info

    library(rvest)
    library(dplyr)
    
    URL <- "https://locations.westernunion.com/search/nigeria/"
    
    URL %>%
      read_html() %>%
      html_nodes("div.info a") %>%
      html_attr("href") %>%
      grep("Nigeria$", ., value = TRUE)
    
    #[1] "/ng/ebonyi/onueke/47908be48d424b6fba108b020c60b517?loc=+Nigeria"      
    #[2] "/ng/plateau/plateau/393aa00a34ded9201b3c0c2fd70c02b3?loc=+Nigeria"    
    #[3] "/ng/bayelsa/otuoke/046d3ae90f58169a7cc896b96e9ccfac?loc=+Nigeria"     
    #[4] "/ng/ogun/abeokuta/fab00c55961bc48312029f13e7b75277?loc=+Nigeria"      
    #[5] "/ng/ogun/idi-iroko/63803a3c50d4cb4b44f473cfd8cb96b1?loc=+Nigeria"     
    #[6] "/ng/-/akwaibom/4c1dd6c2953a0d396500157d97ddf0ca?loc=+Nigeria"  
    #....
    

    但是,我认为这只是 URL 的一部分,您需要在每个提取的部分添加 "https://locations.westernunion.com" 以获得准确的 URL

    URL %>%
      read_html() %>%
      html_nodes("div.info a") %>%
      html_attr("href") %>%
      grep("Nigeria$", ., value = TRUE) %>%
      paste0("https://locations.westernunion.com", .)
    
    #[1] "https://locations.westernunion.com/ng/ebonyi/onueke/47908be48d424b6fba108b020c60b517?loc=+Nigeria"      
    #[2] "https://locations.westernunion.com/ng/plateau/plateau/393aa00a34ded9201b3c0c2fd70c02b3?loc=+Nigeria"    
    #[3] "https://locations.westernunion.com/ng/bayelsa/otuoke/046d3ae90f58169a7cc896b96e9ccfac?loc=+Nigeria"     
    #[4] "https://locations.westernunion.com/ng/ogun/abeokuta/fab00c55961bc48312029f13e7b75277?loc=+Nigeria"      
    #[5] "https://locations.westernunion.com/ng/ogun/idi-iroko/63803a3c50d4cb4b44f473cfd8cb96b1?loc=+Nigeria"     
    #[6] "https://locations.westernunion.com/ng/-/akwaibom/4c1dd6c2953a0d396500157d97ddf0ca?loc=+Nigeria" 
    #....
    

    现在这些 URL 可用于流程的第 2 步。

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2013-01-30
      • 2020-03-19
      • 1970-01-01
      • 2021-04-07
      • 1970-01-01
      • 2020-06-27
      • 2021-05-23
      相关资源
      最近更新 更多