使用包 readxl 将 xlsx 数据导入 R 时指定列类型答案

【问题标题】：Specifying Column Types when Importing xlsx Data to R with Package readxl使用包 readxl 将 xlsx 数据导入 R 时指定列类型
【发布时间】：2015-07-26 05:15:49
【问题描述】：

我正在使用readxl 0.1.0 下的包readxl 0.1.0 将xlsx 2007 表导入R 3.2.1patched Windows 7 64。表格的大小约为 25,000 行 x 200 列。

函数read_excel() 是一种享受。我唯一的问题是将列类（数据类型）分配给稀疏填充的列。例如，对于 20,000 行，给定的列可能是 NA，然后将在第 20,001 行取字符值。在扫描列的前 n 行并仅查找 NAs 时，read_excel() 似乎默认为列类型 numeric。导致问题的数据是指定数字的列中的字符。当达到错误限制时，执行停止。我实际上想要稀疏列中的数据，因此将错误限制设置得更高不是解决方案。

我可以通过查看引发的警告来识别有问题的列。 read_excel() 有一个选项，可以根据包文档通过设置参数 col_types 来断言列的数据类型：

NULL 从电子表格或包含blank、numeric、date 或text 的字符向量中猜测。

但这是否意味着我必须构建一个长度为 200 的向量，在几乎每个位置填充 blank 和 text 在与违规列相对应的少数位置？

在几行R 代码中可能有一种方法可以做到这一点。创建一个所需长度的向量并用blanks 填充它。可能是另一个向量，其中包含要强制为text 的列数，然后... 或者可能只调用read_excel() 的猜测不符合要求的列。

如果有任何建议，我将不胜感激。

提前致谢。

【问题讨论】：

标签： r readxl

【解决方案1】：

自readxl 1.x 版以来的新解决方案：

solution in the currently preferred answer 不再适用于 readxl 的 0.1.0 以上的新版本，因为使用的包内部函数 readxl:::xlsx_col_types 不再存在。

新的解决方案是使用新引入的参数guess_max来增加用于“猜测”列的适当数据类型的行数：

read_excel("My_Excel_file.xlsx", sheet = 1, guess_max = 1048576)

值 1,048,576 是 Excel 当前支持的最大行数，请参阅 Excel 规范：https://support.office.com/en-us/article/Excel-specifications-and-limits-1672b34d-7043-467e-8e27-269d656771c3

PS：如果您关心性能使用所有行来猜测数据类型：read_excel 似乎只读取文件一次并且猜测是在内存中完成的，那么性能损失是与保存的作品相比非常小。

【讨论】：

【解决方案2】：

这取决于你的数据在不同列的不同地方是否稀疏，以及有多稀疏。我发现有更多的行并没有改善解析：大多数仍然是空白的，并被解释为文本，即使它们后来变成了日期等等。

一种解决方法是生成 Excel 表的第一个数据行以包含每一列的代表性数据，并使用它来猜测列类型。我不喜欢这样，因为我想保留原始数据。

如果您在电子表格的某处有完整的行，另一种解决方法是使用nskip 而不是n。这给出了列猜测的起点。假设数据第 117 行有完整的数据集：

readxl:::xlsx_col_types(path = "a.xlsx", nskip = 116, n = 1)

请注意，您可以直接调用该函数，而无需在命名空间中编辑该函数。

然后您可以使用电子表格类型的向量来调用 read_excel：

col_types <- readxl:::xlsx_col_types(path = "a.xlsx", nskip = 116, n = 1)
dat <- readxl::read_excel(path = "a.xlsx", col_types = col_types)

然后您可以手动更新仍然出错的任何列。

【讨论】：

您能详细说明您使用的符号吗？我想对包 xlsx、函数 read.xlsx 和参数 colClasses 做类似的事情。谢谢！

【解决方案3】：

我也遇到过类似的问题。

在我的例子中，空行和空列被用作分隔符。表格中包含很多表格（格式不同）。所以，{openxlsx} 和 {readxl} 包不适合这种情况，导致 openxlsx 删除空列（并且没有参数来改变这种行为）。 Readxl 包按您描述的那样工作，可能会丢失一些数据。

结果，我认为，如果您想自动处理大量 Excel 数据，最好的解决方案是在不更改“文本”格式的情况下读取工作表，然后根据您的规则处理 data.frames。

这个函数可以不加改动地读取表格（感谢@jack-wasey）：

loadExcelSheet<-function(excel.file, sheet)
{
  require("readxl")
  sheets <- readxl::excel_sheets(excel.file)
  sheet.num <- match(sheet, sheets) - 1
  num.columns <- length(readxl:::xlsx_col_types(excel.file, sheet =   sheet.num,
                                              nskip = 0, n = 1))

  return.sheet <- readxl::read_excel(excel.file, sheet = sheet,
                                col_types = rep("text", num.columns),
                                col_names = F)
  return.sheet 
}

【讨论】：

with columnNames loadExcelSheet
此解决方案将不再适用于 readxl 版本 1.x，因为内部函数 readxl:::xlsx_col_types 已被删除。要解决这个问题，你会看到这个答案：stackoverflow.com/a/46122161/4468078

【解决方案4】：

阅读源代码，看起来列类型是由函数xls_col_types 或xlsx_col_types 猜测的，它们在 Rcpp 中实现，但具有默认值：

xls_col_types <- function(path, na, sheet = 0L, nskip = 0L, n = 100L, has_col_names = FALSE) {
    .Call('readxl_xls_col_types', PACKAGE = 'readxl', path, na, sheet, nskip, n, has_col_names)
}

xlsx_col_types <- function(path, sheet = 0L, na = "", nskip = 0L, n = 100L) {
    .Call('readxl_xlsx_col_types', PACKAGE = 'readxl', path, sheet, na, nskip, n)
}

我的 C++ 非常生锈，但看起来 n=100L 是指示要读取多少行的命令。

由于这些是非导出函数，请粘贴：

fixInNamespace("xls_col_types", "readxl")
fixInNamespace("xlsx_col_types", "readxl")

然后在弹出窗口中，将n = 100L 更改为更大的数字。然后重新运行文件导入。

【讨论】：

有时，能够控制扫描的行数以确定数据类型肯定会很有用。然而，在这个问题中，可能需要扫描非常远的人口稀少的 cols。我认为主要问题是一种方便的方法来指定不需要默认行为的几个列的所需数据类型。在他对stackoverflow.com/q/6099243 Mikko 的回答中，他似乎建议了一种针对 xlsx 包中的 read.xlsx2 执行此操作的方法。也许 readxl 的类似作品。非常感谢您检查 readxl_xlsx_col_types 的来源。

【解决方案5】：

查看source，我们看到有一个返回猜测列类型的Rcpp调用：

xlsx_col_types <- function(path, sheet = 0L, na = "", nskip = 0L, n = 100L) {
    .Call('readxl_xlsx_col_types', PACKAGE = 'readxl', path, sheet, na, nskip, n)
}

您可以看到，默认情况下，nskip = 0L, n = 100L 检查前 100 行以猜测列类型。您可以更改nskip 以忽略标题文本并增加n（以更慢的运行时间为代价）：

col_types <-  .Call( 'readxl_xlsx_col_types', PACKAGE = 'readxl', 
                     path = file_loc, sheet = 0L, na = "", 
                     nskip = 1L, n = 10000L )

# if a column type is "blank", no values yet encountered -- increase n or just guess "text"
col_types[col_types=="blank"] <- "text"

raw <- read_excel(path = file_loc, col_types = col_types)

如果不查看 .Rcpp，我并不清楚 nskip = 0L 是跳过标题行（c++ 计数中的第零行）还是不跳过任何行。我只使用nskip = 1L 避免了歧义，因为跳过我的数据集的一行不会影响整体列类型的猜测。

【讨论】：

【解决方案6】：

用于猜测列类型的内部函数可以设置为要扫描的任意行数。但是read_excel()没有实现（还没有？）。

下面的解决方案只是对原始函数read_excel() 的重写，其参数n_max 默认为所有行。由于缺乏想象力，这个扩展功能被命名为read_excel2。

只需将read_excel 替换为read_excel2 即可按所有行评估列类型。

# Inspiration: https://github.com/hadley/readxl/blob/master/R/read_excel.R 
# Rewrote read_excel() to read_excel2() with additional argument 'n_max' for number
# of rows to evaluate in function readxl:::xls_col_types and
# readxl:::xlsx_col_types()
# This is probably an unstable solution, since it calls internal functions from readxl.
# May or may not survive next update of readxl. Seems to work in version 0.1.0
library(readxl)

read_excel2 <- function(path, sheet = 1, col_names = TRUE, col_types = NULL,
                       na = "", skip = 0, n_max = 1050000L) {

  path <- readxl:::check_file(path)
  ext <- tolower(tools::file_ext(path))

  switch(readxl:::excel_format(path),
         xls =  read_xls2(path, sheet, col_names, col_types, na, skip, n_max),
         xlsx = read_xlsx2(path, sheet, col_names, col_types, na, skip, n_max)
  )
}
read_xls2 <- function(path, sheet = 1, col_names = TRUE, col_types = NULL,
                     na = "", skip = 0, n_max = n_max) {

  sheet <- readxl:::standardise_sheet(sheet, readxl:::xls_sheets(path))

  has_col_names <- isTRUE(col_names)
  if (has_col_names) {
    col_names <- readxl:::xls_col_names(path, sheet, nskip = skip)
  } else if (readxl:::isFALSE(col_names)) {
    col_names <- paste0("X", seq_along(readxl:::xls_col_names(path, sheet)))
  }

  if (is.null(col_types)) {
    col_types <- readxl:::xls_col_types(
      path, sheet, na = na, nskip = skip, has_col_names = has_col_names, n = n_max
    )
  }

  readxl:::xls_cols(path, sheet, col_names = col_names, col_types = col_types, 
                    na = na, nskip = skip + has_col_names)
}

read_xlsx2 <- function(path, sheet = 1L, col_names = TRUE, col_types = NULL,
                       na = "", skip = 0, n_max = n_max) {
  path <- readxl:::check_file(path)
  sheet <-
    readxl:::standardise_sheet(sheet, readxl:::xlsx_sheets(path))

  if (is.null(col_types)) {
    col_types <-
      readxl:::xlsx_col_types(
        path = path, sheet = sheet, na = na, nskip = skip + isTRUE(col_names), n = n_max
      )
  }

  readxl:::read_xlsx_(path, sheet, col_names = col_names, col_types = col_types, na = na,
             nskip = skip)
}

由于这种扩展的猜测，您可能会受到严重的性能影响。还没有尝试过真正的大数据集，只是尝试了足以验证功能的较小数据。

【讨论】：