XLS 到 CSV 或 R data.frame答案

【问题标题】：XLS to CSV or R data.frameXLS 到 CSV 或 R data.frame
【发布时间】：2015-03-31 05:55:40
【问题描述】：

我需要（非手动）下载此文件并将内容转换为 data.frame，忽略几行的能力会很有用。 我正在专门寻找 R 或 Python 中的解决方案。

文件本身可以取自：

http://horizons.prod.transmissionmedia.ca/GetDailyFundSummaryExcel.aspx?lang=en

这是我到目前为止所做的：

我试过 XLConnect (Error: IllegalArgumentException (Java): Your InputStream was neither an OLE2 stream, nor an OOXML stream)
我试过 RODBC (Error in odbcConnectExcel("xl.file") : odbcConnectExcel is only usable with 32-bit Windows)
我在 Python 中尝试过 xlrd (XLRDError: Unsupported format or corrupt file)
我试过 gdata (Error in xls2sep(xls, sheet, verbose = verbose, ..., method = method, : Intermediate file '...' missing!)

如果您在记事本中打开该文件，它是一个 xml 文件，在 Excel 中打开时，您会收到一条警告消息“格式和扩展名不匹配”。

我可以自己探索的想法也很有用，如果您没有答案，请发表评论。

到目前为止我对 XML/regex 的尝试：

library(XML)
library(stringr)
download.file("http://horizons.prod.transmissionmedia.ca/GetDailyFundSummaryExcel.aspx?lang=en", destfile = "horizons.xls")
doc <- readLines(con = "horizons.xls")
doc <- str_extract(doc,"<Table[^>]*>(.*?)</Table>")
doc <- xmlParse(doc)
listing <- xpathApply(doc, "//Row", xmlToDataFrame)
listing <- listing[4:length(listing)]
listing <- do.call(rbind,lapply(listing, t))[,6:16]
listing[,3:11] <- gsub("[^-.0-9]", "", listing[,3:11])
listing <- as.data.frame(listing, row.names = NULL,stringsAsFactors = FALSE,)
listing$V1 <- str_replace_all(listing$V1, "[^a-zA-Z0-9]", " ")
listing[5:11] <- lapply(listing[5:11],as.numeric)
names(listing) <- c(
    "Product Name",
    "Ticker",
    "Class",
    "Price",
    "Price % Change",
    "Volume",
    "NAV/unit",
    "NAV % Change",
    "% Prem/Disc",
    "Outst. Shares"
)

【问题讨论】：

试试 read.xls r-bloggers.com/read-excel-files-from-r
第一：不要在 Internet Explorer 中下载文件，因为 *在某些版本中“它会不必要地将文件扩展名从 .xls 更改为 .xml。
第二：有没有考虑自己解析XML？
@Steve 使用 R XML 包下载文件。不是手工做的。
@Steve XML 对我来说仍然很神秘。我没有考虑过。

标签： python r excel

【解决方案1】：

也许在 R 中可以做到这一点：

library(XML)
download.file("http://horizons.prod.transmissionmedia.ca/GetDailyFundSummaryExcel.aspx?lang=en", file.path(tempdir(), "xls.xml"))
doc <- xmlParse(file.path(tempdir(), "xls.xml"))
df <- xmlToDataFrame(nodes = getNodeSet(doc, "//ss:Row", "ss")[-(1:2)], stringsAsFactors = FALSE)
names(df) <- unlist(df[1, ], use.names = F); df <- df[-1, ] # put first row as col header and delete it
head(df)
# # # Language            ETF Type              Subtype                                 Product Name Ticker Class Closing Date   Price Price % Change Volume NAV/unit NAV % Change % Prem/Disc Outst. Shares
# 2 1 1       en INDEX AND BENCHMARK Equities — Large Cap                  Horizons S&P 500® Index ETF    HXS         2015-03-30 47.3800           2.09 314223  47.4302       1.9621       -0.11       5675671
# 3 2 2       en                                                     Horizons S&P 500® Index ETF (US$)  HXS.U         2015-03-30 37.2800          -0.19  52769  37.3539       1.2312       -0.20       5675671
# 4 3 3       en                                                        Horizons S&P/TSX 60™ Index ETF    HXT         2015-03-30 27.9600           0.98 372656  27.9144       0.9095        0.16      22019328
# 5 4 4       en                                                  Horizons S&P/TSX 60™ Index ETF (US$)  HXT.U         2015-03-30 22.0300          -0.56      0  21.9842       0.1864        0.21      22019328
# 6 5 5       en                                              Horizons S&P/TSX Capped Energy Index ETF    HXE         2015-03-30 21.4800           0.00   1200  21.5441       0.6578       -0.30        902485
# 7 6 6       en                                          Horizons S&P/TSX Capped Financials Index ETF    HXF         2015-03-30 30.0100           0.00    900  30.0804       0.1395       -0.23        500440

【讨论】：

非常好。我有一些非常相似但也更丑陋的东西。您是如何知道如何使用 xmlToDataFrame 和 getNodeSet 和 //ss:Row 的？我的意思是 - 你是从哪里学来的？
大部分来自stackoverflowing 和help(package = "XML")... :-)
1. omegahat.org/RSXML/Tour.pdf 2. cran.r-project.org/web/packages/XML/XML.pdf
对于路人来说，这非常有帮助，并且以一种非常容易理解的方式编写：w3schools.com/xpath

【解决方案2】：

这可能不是最好的方法，但会有所帮助。

require("XML")

myfile1 <- download.file(http://horizons.prod.transmissionmedia.ca/GetDailyFundSummaryExcel.aspx?lang=en)

doc <- xmlParse(myfile1)

root_doc <- xmlRoot(doc)

identify_worksheet <- c()

for (i in 1:xmlSize(root_doc)){ 
  identify_worksheet <- c(identify_worksheet, xmlName(root_doc[[i]]) == "Worksheet") 
}

worksheet_index = which(identify_worksheet == TRUE)

name1 <- xmlSApply(root_doc[[worksheet_index]], xmlName)

row_size <- xmlSize(root_doc[[worksheet_index]][[name1]])

col_size = max(xmlSApply(root_doc[[worksheet_index]][[name1]], xmlSize))

row_index = which(xmlSApply(root_doc[[worksheet_index]][[name1]], xmlSize) == max(xmlSApply(root_doc[[worksheet_index]][[name1]], xmlSize)))

df1 <- data.frame(matrix(nrow = length(row_index)-1, ncol = col_size), stringsAsFactors = FALSE)

colnames(df1) <- getChildrenStrings(root_doc[[worksheet_index]][[name1]][[row_index[1]]])

for(i in 2:length(row_index)){
  df_index = i-1
  df1[df_index,] <- getChildrenStrings(root_doc[[worksheet_index]][[name1]][[row_index[i]]])
}

View(df1)

df2 <- df1[4:ncol(df1)]

View(df2)

从 xml 格式的 Excel 工作表中识别名称。我想知道，以下名称是 xml 格式的 excel 工作表遵循的标准，而且，当有多个工作表时，工作表的名称是否以数字作为后缀递增（例如：Worksheet1、Worksheet2 等等。 .).

xmlName(root_doc)
[1] "Workbook"
xmlName(root_doc[[1]])
[1] "DocumentProperties"
xmlName(root_doc[[2]])
[1] "Styles"
xmlName(root_doc[[3]])
[1] "Worksheet"

输出

head(df1)
  # # Language            ETF Type              Subtype                                 Product Name Ticker Class Closing Date   Price Price % Change Volume
1 1 1       en INDEX AND BENCHMARK Equities — Large Cap                  Horizons S&P 500® Index ETF    HXS         2015-03-30 47.3800           2.09 314223
2 2 2       en                                                     Horizons S&P 500® Index ETF (US$)  HXS.U         2015-03-30 37.2800          -0.19  52769
3 3 3       en                                                        Horizons S&P/TSX 60™ Index ETF    HXT         2015-03-30 27.9600           0.98 372656
4 4 4       en                                                  Horizons S&P/TSX 60™ Index ETF (US$)  HXT.U         2015-03-30 22.0300          -0.56      0
5 5 5       en                                              Horizons S&P/TSX Capped Energy Index ETF    HXE         2015-03-30 21.4800           0.00   1200
6 6 6       en                                          Horizons S&P/TSX Capped Financials Index ETF    HXF         2015-03-30 30.0100           0.00    900
  NAV/unit NAV % Change % Prem/Disc Outst. Shares
1  47.4302       1.9621       -0.11       5675671
2  37.3539       1.2312       -0.20       5675671
3  27.9144       0.9095        0.16      22019328
4  21.9842       0.1864        0.21      22019328
5  21.5441       0.6578       -0.30        902485
6  30.0804       0.1395       -0.23        500440

head(df2)
             ETF Type              Subtype                                 Product Name Ticker Class Closing Date   Price Price % Change Volume NAV/unit
1 INDEX AND BENCHMARK Equities — Large Cap                  Horizons S&P 500® Index ETF    HXS         2015-03-30 47.3800           2.09 314223  47.4302
2                                                     Horizons S&P 500® Index ETF (US$)  HXS.U         2015-03-30 37.2800          -0.19  52769  37.3539
3                                                        Horizons S&P/TSX 60™ Index ETF    HXT         2015-03-30 27.9600           0.98 372656  27.9144
4                                                  Horizons S&P/TSX 60™ Index ETF (US$)  HXT.U         2015-03-30 22.0300          -0.56      0  21.9842
5                                              Horizons S&P/TSX Capped Energy Index ETF    HXE         2015-03-30 21.4800           0.00   1200  21.5441
6                                          Horizons S&P/TSX Capped Financials Index ETF    HXF         2015-03-30 30.0100           0.00    900  30.0804
  NAV % Change % Prem/Disc Outst. Shares
1       1.9621       -0.11       5675671
2       1.2312       -0.20       5675671
3       0.9095        0.16      22019328
4       0.1864        0.21      22019328
5       0.6578       -0.30        902485
6       0.1395       -0.23        500440

【讨论】：

是的，不幸的是，这很可能每天在没有 excel、记事本或其他东西的 nix 集群上运行。