如何编写 rscript 以从 HTML 表中提取 URL答案

【问题标题】：How to write rscript to to extract URL from HTML table如何编写 rscript 以从 HTML 表中提取 URL
【发布时间】：2021-09-29 08:22:30
【问题描述】：

我正在尝试从页面的元素中提取每个 URL，例如“https://..zip”：https://divvy-tripdata.s3.amazonaws.com/index.html，使用 rvest 库如下：

link <- "https://divvy-tripdata.s3.amazonaws.com/index.html"

library(rvest)
library(xml2)

html <- read_html(link)

html %>% html_attrs("href")

输出：

html %>% html_attrs("href") html_attrs(., "href") 中的错误：未使用的参数 ("href")

你能帮我用 R 从上面的链接中提取所有的 URL 吗？

HTML： https://i.stack.imgur.com/5BiFU.jpg

【问题讨论】：

标签： html r web-scraping rvest

【解决方案1】：

链接来自浏览器发出的附加 GET 请求，该请求返回 xml。您仍然可以使用 rvest 并获取 Key 节点，然后完成 url。

library(rvest)

base_url <- "https://divvy-tripdata.s3.amazonaws.com"
files <- read_html(base_url) |> html_elements('key') |> html_text() |> url_absolute(base_url)

对于较旧的 R 版本，将 |> 替换为 %>% 并添加 library(magrittr) 作为导入。

【讨论】：

太棒了！对你来说看起来很容易，但这让我忙了两天在网上搜索解决方案。非常感谢您的帮助！
不客气

【解决方案2】：

Base R解决方案，使用url后一级读取解析xml：

# Store as a variable the path url to be scrapped: base_url => character scalar
base_url <- "https://divvy-tripdata.s3.amazonaws.com"

# Resolve the zip urls: zip_urls => character vector
zip_urls <- paste(
  base_url, 
  gsub(
    ">(.*?)<\\/",
    "\\1",
    grep(
      "\\.zip", 
      strsplit(
        readLines(base_url), 
        "\\<Key\\>")[[2]],
      value = TRUE
    )
  ),
  sep = "/"
)

【讨论】：