【问题标题】:extracting url ending in .pdf from a list of html nodes in R从R中的html节点列表中提取以.pdf结尾的url
【发布时间】:2021-01-28 05:21:41
【问题描述】:

我有一个 url 列表,每个 url 都包含一个指向 pdf 文档的 url。我想使用 R 提取和下载 pdf 文档。这是我到目前为止的代码:

从reliefweb.int 下载数据

#get all the results for the Afghanistan HNO search
result <- GET("https://api.reliefweb.int/v1/reports?appname=rwint-user-0&profile=list&preset=latest&slim=1&query[value]=(primary_country.iso3%3A%22afg%22)%20AND%20ocha_product%3A%22Humanitarian%20Needs%20Overview%22%20AND%20source%3A%22UN%20Office%20for%20the%20Coordination%20of%20Humanitarian%20Affairs%22&query[operator]=AND")

#create a list of all the urls listed in the search page
rawToChar(result$content)
result2<- fromJSON(rawToChar(result$content))
urllist<- result2[["data"]][["fields"]][["url"]]

#Extraxt links to the pdf docs
urlpdf<- lapply(urllist,read_html)

使用此代码,我有一个 html 节点列表,但我被困在如何从中提取 .pdf 网址上。知道如何进行,或者是否有更有效的方法?

【问题讨论】:

    标签: r html-parsing


    【解决方案1】:

    你似乎在使用rvest,所以你可以这样做:

    library(httr)
    library(rvest)
    library(jsonlite)
    
    result2<- fromJSON(rawToChar(result$content))
    urllist<- result2[["data"]][["fields"]][["url"]]
    
    urlpdf<- lapply(urllist, read_html)
    
    links <- lapply(urlpdf, function(x) html_attr(html_nodes(x, xpath = "//a"), "href"))
    
    pdfs <- lapply(links, function(x) grep("\\.pdf$", x, value = TRUE))
    

    结果:

    pdfs
    #> [[1]]
    #> [1] "https://reliefweb.int/sites/reliefweb.int/files/resources/afg_humanitarian_needs_overview_2020.pdf"
    #> 
    #> [[2]]
    #> [1] "https://reliefweb.int/sites/reliefweb.int/files/resources/afg_2019_humanitarian_needs_overview.pdf"
    #> 
    #> [[3]]
    #> [1] "https://reliefweb.int/sites/reliefweb.int/files/resources/afg_2018_humanitarian_needs_overview_1.pdf"
    #> 
    #> [[4]]
    #> [1] "https://reliefweb.int/sites/reliefweb.int/files/resources/afg_2017_hno_english.pdf"
    #> 
    #> [[5]]
    #> [1] "https://reliefweb.int/sites/reliefweb.int/files/resources/afg_2016_hno_final_20151209.pdf"
    #> 
    #> [[6]]
    #> [1] "https://reliefweb.int/sites/reliefweb.int/files/resources/Afghanistan%20HRP%202015%20HNO%20Final%2023Nov2014%20%281%29.pdf"
    #> 
    #> [[7]]
    #> [1] "https://afg.humanitarianresponse.info/system/files/documents/files/Afg_2014HNO_FINALv2_0.pdf"
    #> [2] "https://reliefweb.int/sites/reliefweb.int/files/resources/Afg_2014HNO_FINALv2_0.pdf"
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2012-06-12
      • 2022-01-04
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多