R：使用 rvest 包而不是 XML 包从 URL 获取链接答案

【问题标题】：R: Using rvest package instead of XML package to get links from URLR：使用 rvest 包而不是 XML 包从 URL 获取链接
【发布时间】：2015-02-02 12:52:56
【问题描述】：

我使用 XML 包从 this url 获取链接。

# Parse HTML URL
v1WebParse <- htmlParse(v1URL)
# Read links and and get the quotes of the companies from the href
t1Links <- data.frame(xpathSApply(v1WebParse, '//a', xmlGetAttr, 'href'))

虽然这种方法非常有效，但我使用了rvest，并且在解析网络方面似乎比XML 更快。我尝试了html_nodes 和html_attrs，但无法正常工作。

【问题讨论】：

rvest 使用XML 包进行节点提取。它真的不应该更快。

标签： xml r web-scraping rvest

【解决方案1】：

尽管有我的评论，但您可以使用rvest 来执行此操作。请注意，我们需要首先读取带有htmlParse 的页面，因为该站点将该文件的内容类型设置为text/plain，并且将rvest 扔进了一个头晕。

library(rvest)
library(XML)

pg <- htmlParse("http://www.bvl.com.pe/includes/empresas_todas.dat")
pg %>% html_nodes("a") %>% html_attr("href")

##   [1] "/inf_corporativa71050_JAIME1CP1A.html" "/inf_corporativa10400_INTEGRC1.html"  
##   [3] "/inf_corporativa66100_ACESEGC1.html"   "/inf_corporativa71300_ADCOMEC1.html"  
## ...
## [273] "/inf_corporativa64801_VOLCAAC1.html"   "/inf_corporativa58501_YURABC11.html"  
## [275] "/inf_corporativa98959_ZNC.html"

这进一步说明了rvest 的XML 包基础。

更新

rvest::read_html() 现在可以直接处理了：

pg <- read_html("http://www.bvl.com.pe/includes/empresas_todas.dat")

【讨论】：

你是对的，对于节点提取rvest 使用XML。我将在聊天中讨论我使用这些包的站点的时间差异。感谢您的回复。

【解决方案2】：

我知道您正在寻找rvest 的答案，但这是使用XML 包的另一种方法，它可能比您正在做的更有效。

你见过example(htmlParse)中的getLinks()函数吗？我使用示例中的这个修改版本来获取href 链接。它是一个处理函数，因此我们可以在读取值时收集它们，从而节省内存并提高效率。

links <- function(URL) 
{
    getLinks <- function() {
        links <- character()
        list(a = function(node, ...) {
                links <<- c(links, xmlGetAttr(node, "href"))
                node
             },
             links = function() links)
        }
    h1 <- getLinks()
    htmlTreeParse(URL, handlers = h1)
    h1$links()
}

links("http://www.bvl.com.pe/includes/empresas_todas.dat")
#  [1] "/inf_corporativa71050_JAIME1CP1A.html"
#  [2] "/inf_corporativa10400_INTEGRC1.html"  
#  [3] "/inf_corporativa66100_ACESEGC1.html"  
#  [4] "/inf_corporativa71300_ADCOMEC1.html"  
#  [5] "/inf_corporativa10250_HABITAC1.html"  
#  [6] "/inf_corporativa77900_PARAMOC1.html"  
#  [7] "/inf_corporativa77935_PUCALAC1.html"  
#  [8] "/inf_corporativa77600_LAREDOC1.html"  
#  [9] "/inf_corporativa21000_AIBC1.html"     
#  ...
#  ...

【讨论】：

帮了大忙，我没有检查htmlParse 中的示例，但我根据您的建议修改了我的代码。在这种情况下，XML 工作得很好，但从 web 获取历史价格比 rvest 需要更长的时间。
价格？您的问题表明您正在尝试获取链接
是的，来自this web 我试图从该站点获取所有链接，而在this site 我试图解析一个包含 SIDERC1 报价历史价格的表格。我在两个站点上都使用了XML，但我只能在后者上使用rvest。

【解决方案3】：

# Option 1
library(RCurl)
getHTMLLinks('http://www.bvl.com.pe/includes/empresas_todas.dat')

# Option 2
library(rvest)
library(pipeR) # %>>% will be faster than %>%
html("http://www.bvl.com.pe/includes/empresas_todas.dat")%>>% html_nodes("a") %>>% html_attr("href")

【讨论】：

选项 1 似乎不再适用于当前版本的 RCurl。

【解决方案4】：

Richard 的答案适用于 HTTP 页面，但不适用于我需要的 HTTPS 页面（维基百科）。我将 RCurl 的 getURL 函数替换为如下：

library(RCurl)

links <- function(URL) 
{
  getLinks <- function() {
    links <- character()
    list(a = function(node, ...) {
      links <<- c(links, xmlGetAttr(node, "href"))
      node
    },
    links = function() links)
  }
  h1 <- getLinks()
  xData <- getURL(URL)
   htmlTreeParse(xData, handlers = h1)
  h1$links()
}

【讨论】：