从 url 读取数据时出现“404 Not Found”错误，尽管文件存在答案

【问题标题】：Getting '404 Not Found' error when reading data from url, despite file existing从 url 读取数据时出现“404 Not Found”错误，尽管文件存在
【发布时间】：2017-08-30 12:29:45
【问题描述】：

我正在编写一个程序来收集来自this 页面的所有每日 .csv 文件。但是，对于某些文件，我收到错误消息：

Error in open.connection(file, "rt") : cannot open the connection
In addition: Warning message:
In open.connection(file, "rt") :
  cannot open URL 'https://www.eride.ri.gov/eride2K5/AggregateAttendance/Data/05042016_DailyAbsenceData.csv': HTTP status was '404 Not Found'

以下是 2016 年 5 月 12 日文件中的示例：

read.csv(url("https://www.eride.ri.gov/eride2K5/AggregateAttendance/Data/05122016_DailyAbsenceData.csv"))

奇怪的是，如果您访问该网站，找到该文件的链接并单击它，R 不再给出错误并正确读取该文件。这里发生了什么，我如何阅读这些文件而无需手动单击它们？（注意，只有你们中的第一个能够复制该问题，因为单击该文件可以解决其余问题。）

最终，我想使用以下循环来收集所有文件：

# Create a vector of dates. This is the interval data is collected from. 
dates = seq(as.Date("2016-05-1"), as.Date("2016-05-30"), by="days")
# Format to match the filename prefixes
dates = strftime(dates, '%m%d%Y')
# Create the vector of a file names I want read. 
file.names = paste(dates,"_DailyAbsenceData.csv", sep = "")

# A loop that reads the .csv files into a list of data frame
daily.truancy = list()
for (i in 1:length(dates)) {
  tryCatch({ #this function prevents the loop from stopping from an error when read.csv cannot access the file
    daily.truancy[[i]] = read.csv(url(paste("https://www.eride.ri.gov/eride2K5/AggregateAttendance/Data/", file.names[i], sep = "")), sep = ",")
    stop("School day") #this indicates that the file was successfully read in to the list
  }, error=function(e){cat("ERROR :",conditionMessage(e), "\n")})
}

# Unlist the daily data to a large panel
daily.truancy.2016 <- do.call("rbind", daily.truancy)

请注意，实际上没有文件的日子（周末）会给出相同的错误消息。这不是问题。

【问题讨论】：

我没有收到您列出的文件的错误。如果您从某个网站获得 404 返回码，您应该联系该网站以了解原因。 R 无法使不存在的文件突然出现。这些文件可以按需生成。
他们可能会专门阻止您的客户端以阻止自动下载文件。
MrFlick，如果文件是按需创建的，除了单击网页中的图标之外，还有其他方法可以触发生成过程吗？有没有办法在 R 中触发它？
Flick 先生，这些文件确实存在，至少作为可点击的链接。这就是这篇文章的重点。

标签： r csv url http-status-code-404

【解决方案1】：

由于页面是动态生成的，url 函数在此处不起作用，但RSelenium 被明确设计为此类任务。

我要感谢 @jdharrison 提供这个出色的软件包以及他对具有挑战性的问题的回答，请参阅他的 answers page 更多示例。

这里解释了基本设置过程：RSelenium Setup

要提取我们感兴趣的elementID，最简单的方法是右键单击该元素并单击chrome中的“Inspect”，我不确定其他浏览器，它们应该具有类似的功能，可能名称不同

这将打开一个包含所选元素的 html 标记的侧窗口。

library(RSelenium)
RSelenium:::startServer()

#you can replace browser name with your version e.g. firefox

remDr <- remoteDriver(browserName = "chrome")
remDr$open(silent = TRUE)

appURL <- 'https://www.eride.ri.gov/eride2K5/AggregateAttendance/AttendanceReports.aspx'


monthYearCounter = 1

#total months to download
totalMonths = 2 

remDr$navigate(appURL)


for(monthYearCounter in 1:totalMonths) {


#Active monthYear on the page e.g April 2017
monthYearElem = remDr$findElement("xpath", "//td[contains(@style,'width:70%')]")

#highlights the element in yellow for visual feedback
monthYearElem$highlightElement()

#extract text
monthYearText = unlist(monthYearElem$getElementAttribute("innerHTML"))

cat(paste0("Processing month year=",monthYearText,"\n"))



# For a particular month all the CSV files are listed in a table



#extract elementID of all CSV files using the pattern "imgBtnXls"
csvFilesElemList = remDr$findElements("xpath", "//input[contains(@id,'imgBtnXls')]")


#For all elements, enable click function and save file to default download location
#Ensure delay between consecutive requests from burdening the servers

lapply(csvFilesElemList,function(x) {

#
x$clickElement()

#Be nice, do no overload servers with rapid requests!!

Sys.sleep(60)

})



#Go to previous month

remDr$findElement("xpath", "//a[contains(@title,'Go to the previous month')]")$clickElement()


}

【讨论】：

感谢您的回答。我的代码出现以下错误：> RSelenium:::startServer() Error: startServer is now defunct. Users in future can find the function in file.path(find.package("RSelenium"), "examples/serverUtils"). The recommended way to run a selenium server is via Docker. Alternatively see the RSelenium::rsDriver function.)