使用 RCurl 在 R 向量中获取网站目录列表答案

【问题标题】：Get website directory listing in an R vector using RCurl使用 RCurl 在 R 向量中获取网站目录列表
【发布时间】：2013-05-22 19:13:03
【问题描述】：

我正在尝试获取网站目录中的文件列表。有没有一种类似于本地目录列表的 dir() 或 list.files() 命令的方法？我可以使用 RCurl 连接到网站（我需要它，因为我需要通过 HTTPS 进行 SSL 连接）：

library(RCurl)    
text=getURL(*some https website*
,ssl.verifypeer = FALSE
,dirlistonly = TRUE)

但这会创建一个包含文件列表的图像、超链接等的 HTML 文件，但我只需要一个文件的 R 向量，就像使用 dir() 获得的一样。这可能吗？还是我必须进行 HTML 解析才能提取文件名？听起来像一个简单问题的复杂方法。

谢谢，

编辑：如果你可以让它与http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeGencodeV7/ 一起工作，那么你就会明白我的意思了。

【问题讨论】：

标签： r rcurl

【解决方案1】：

这是 getURL 帮助文件中的最后一个示例（带有更新的 URL）：

url <- 'ftp://speedtest.tele2.net/'
filenames = getURL(url, ftp.use.epsv = FALSE, dirlistonly = TRUE)


# Deal with newlines as \n or \r\n. (BDR)
# Or alternatively, instruct libcurl to change \n’s to \r\n’s for us with crlf = TRUE
# filenames = getURL(url, ftp.use.epsv = FALSE, ftplistonly = TRUE, crlf = TRUE)
filenames = paste(url, strsplit(filenames, "\r*\n")[[1]], sep = "")

这能解决你的问题吗？

【讨论】：

对于您提到的 ftp 站点，是的，它可以工作。但它不适用于以下网站：hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/…
亲爱的@dag 我在尝试您给出的建议时遇到以下错误 [这里] (stackoverflow.com/a/17187525/1972786) url <- 'ftp://XX.XX.XXX.XX/images/uploads_webapp/' > filenames = getURL(url, ftp.use.epsv = FALSE, dirlistonly = TRUE) Error in function (type, msg, asError = TRUE) : Failed to connect to 54.251.104.13 port 21: Connection refused
@SanjayMehrotra 谢谢。我现在更改了网址（在您发表评论后差不多两年，是的......）

【解决方案2】：

试试这个：

   library(RCurl)

   dir_list <-
     read.table(
       textConnection(
         getURLContent(ftp://[...]/)
       )
     sep = "",
     strip.white = TRUE)

生成的表格将日期分成 3 个文本字段，但这是一个很大的开始，您可以获取文件名。

【讨论】：

【解决方案3】：

我正在阅读RCurl document 并遇到了一段新代码：

stockReader =
function()
{
values <- numeric() # to which the data is appended when received
# Function that appends the values to the centrally stored vector
read = function(chunk) {
con = textConnection(chunk)
on.exit(close(con))
tmp = scan(con)
values <<- c(values, tmp)
}
list(read = read,
values = function() values # accessor to get result on completion
)
}

紧随其后

reader = stockReader()
getURL(’http://www.omegahat.org/RCurl/stockExample.dat’,
write = reader$read)
reader$values()

它在示例中显示“数字”，但肯定可以修改此代码示例吗？阅读附件。我相信你会找到你想要的。

它也说

getURL()、getForm() 和 postForm() 的基本用法将请求文档的内容作为单个文本块返回。它由 libcurl 设施积累，并组合成一个字符串。然后我们通常会遍历文档的内容以将信息提取到常规数据中，例如向量和数据框。例如，假设我们要求的文件是一个简单的数字流，例如特定股票的价格在不同的时间点。我们将下载文件的内容，然后将其读入 R中的一个向量，以便我们可以分析这些值。不幸的是，这基本上导致同时驻留在内存中的数据的两个副本。这可能是令人望而却步的，或者至少是不适合大型数据集。另一种方法是在 libcurl 接收到数据时以块的形式处理数据。如果我们可以每次 libcurl 从回复中接收数据并做一些有意义的事情时都会收到通知数据，那么我们不需要累积块。我们最大的额外信息将需要拥有的是最大的块。在我们的示例中，我们可以获取每个块并传递它到 scan() 函数以将值转换为向量。然后我们可以将它与来自先前处理的块的向量。

【讨论】：

在示例链接上测试此功能后，这里是它的错误：Error: unexpected input in "getURL(’" > write = reader$read) Error: unexpected ')' in " write = reader$read)" > reader$values() Error in reader$values : object of type 'closure' is not subsettable