【问题标题】:Reading off links on a site and storing them in a list读取网站上的链接并将它们存储在列表中
【发布时间】:2020-09-12 17:05:36
【问题描述】:

我正在尝试从 StatsCan 读取数据的 URL,如下所示:


# 2015
url <- "https://www.nrcan.gc.ca/our-natural-resources/energy-sources-distribution/clean-fossil-fuels/crude-oil/oil-pricing/crude-oil-prices-2015/18122"

x1 <- read_html(url) %>% 
  html_nodes(xpath = '//*[@class="col-md-4"]/ul/li/ul/li/a') %>% 
  html_attr("href")


# 2014
url2 <- "https://www.nrcan.gc.ca/our-natural-resources/energy-sources-distribution/clean-fossil-fuels/crude-oil/oil-pricing/crude-oil-prices-2014/16993"

x2 <- read_html(url) %>% 
  html_nodes(xpath = '//*[@class="col-md-4"]/ul/li/ul/li/a') %>% 
  html_attr("href")

这样做会返回两个空列表;我很困惑,因为这适用于这个链接:https://www.nrcan.gc.ca/our-natural-resources/energy-sources-distribution/clean-fossil-fuels/crude-oil/oil-pricing/18087。最终我想遍历列表并读取每一页上的表格:

for (i in 1:length(x2)){
  out.data <- read_html(x2[i]) %>% 
    html_table(fill = TRUE) %>% 
    `[[`(1) %>% 
    as_tibble()
  write.xlsx(out.data, str_c(destination,i,".xlsx"))
}

【问题讨论】:

    标签: html r xpath web-scraping rvest


    【解决方案1】:

    为了提取所有 url,我建议使用 css 选择器“.field-item li a”并根据模式设置子集。

    links <- read_html(url) %>% 
        html_nodes(".field-item li a") %>% 
        html_attr("href") %>% 
        str_subset("fuel-prices/crude")
    

    【讨论】:

      【解决方案2】:

      您的 XPath 需要修复。您可以使用以下一种:

      //strong[contains(.,"Oil")]/following-sibling::ul//a
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 2011-11-12
        • 1970-01-01
        • 2014-05-03
        • 2021-10-03
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多