fread() 因 integer64 列中的缺失值而失败答案

【问题标题】：fread() fails with missing values in integer64 columnsfread() 因 integer64 列中的缺失值而失败
【发布时间】：2014-02-07 12:24:47
【问题描述】：

阅读下面的文本时，fread() 无法检测到第 8 列和第 9 列中的缺失值。这仅适用于默认选项 integer64="integer64"。设置integer64="double" 或"character" 正确检测NAs。请注意，该文件在 V8 和 V9 中具有三种可能的 NA——,,； , ,;和NA。附加 na.strings=c("NA","N/A",""," "), sep="," 作为选项无效。

使用read.csv() 的工作方式与fread(integer="double") 相同。

要阅读的文本（也是available as a file integer64_and_NA.csv）：

2012,276,,0,"S1","001",1,,724135215,1590915056,
2012,276,2,8,"S1","001",1, ,,154598,0
2012,276,2,12,"S1","001",1,NA,5118863,21819477,
2012,276,2,0,"S1","011",8,3127133583,3127133583,9003982501,0

这是fread() 的输出：

DT <- fread(input="integer64_and_NA.csv", verbose=TRUE, integer64="integer64", na.strings=c("NA","N/A",""," "), sep=",")

Input contains no \n. Taking this to be a filename to open
Detected eol as \r\n (CRLF) in that order, the Windows standard.
Looking for supplied sep ',' on line 4 (the last non blank line in the first 'autostart') ... found ok
Found 11 columns
First row with 11 fields occurs on line 1 (either column names or first row of data)
Some fields on line 1 are not type character (or are empty). Treating as a data row and using default column names.
Count of eol after first data row: 5
Subtracted 1 for last eol and any trailing empty lines, leaving 4 data rows
Type codes: 11114412221 (first 5 rows)
Type codes: 11114412221 (after applying colClasses and integer64)
Type codes: 11114412221 (after applying drop or select (if supplied)
Allocating 11 column slots (11 - 0 NULL)
   0.000s (  0%) Memory map (rerun may be quicker)
   0.000s (  0%) sep and header detection
   0.000s (  0%) Count rows (wc -l)
   0.000s (  0%) Column type detection (first, middle and last 5 rows)
   0.000s (  0%) Allocation of 4x11 result (xMB) in RAM
   0.000s (  0%) Reading data
   0.000s (  0%) Allocation for type bumps (if any), including gc time if triggered
   0.000s (  0%) Coercing data already read in type bumps (if any)
   0.000s (  0%) Changing na.strings to NA
   0.001s        Total

生成的data.table是：

DT
     V1  V2 V3 V4 V5  V6 V7                  V8                  V9        V10 V11
1: 2012 276 NA  0 S1 001  1 9218868437227407266           724135215 1590915056  NA
2: 2012 276  2  8 S1 001  1 9218868437227407266 9218868437227407266     154598   0
3: 2012 276  2 12 S1 001  1 9218868437227407266             5118863   21819477  NA
4: 2012 276  2  0 S1 011  8          3127133583          3127133583 9003982501   0

NA 值在不是integer64 的列中被正确检测到。对于 V8 和 V9，fread() 标记为 integer64，而不是 NA，我们使用“9218868437227407266”。有趣的是，str() 将 V8 和 V9 的各自值返回为NA：

str(DT)

Classes ‘data.table’ and 'data.frame':  4 obs. of  11 variables:
 $ V1 : int  2012 2012 2012 2012
 $ V2 : int  276 276 276 276
 $ V3 : int  NA 2 2 2
 $ V4 : int  0 8 12 0
 $ V5 : chr  "S1" "S1" "S1" "S1"
 $ V6 : chr  "001" "001" "001" "011"
 $ V7 : int  1 1 1 8
 $ V8 :Class 'integer64'  num [1:4] NA NA NA 1.55e-314
 $ V9 :Class 'integer64'  num [1:4] 3.58e-315 NA 2.53e-317 1.55e-314
 $ V10:Class 'integer64'  num [1:4] 7.86e-315 7.64e-319 1.08e-316 4.45e-314
 $ V11: int  NA 0 NA 0
 - attr(*, ".internal.selfref")=<externalptr>

...但没有其他人将它们视为NA：

is.na(DT$V8)
[1] FALSE FALSE FALSE FALSE
max(DT$V8)
integer64
[1] 9218868437227407266
> max(DT$V8, na.rm=TRUE)
integer64
[1] 9218868437227407266
> class(DT$V8)
[1] "integer64"
> typeof(DT$V8)
[1] "double"

这似乎不仅仅是打印/屏幕问题，data.table 将它们视为巨大的整数：

DT[, V12:=as.numeric(V8)]
Warning message:
In as.double.integer64(V8) :
  integer precision lost while converting to double
> DT
     V1  V2 V3 V4 V5  V6 V7                  V8                  V9        V10 V11          V12
1: 2012 276 NA  0 S1 001  1 9218868437227407266           724135215 1590915056  NA 9.218868e+18
2: 2012 276  2  8 S1 001  1 9218868437227407266 9218868437227407266     154598   0 9.218868e+18
3: 2012 276  2 12 S1 001  1 9218868437227407266             5118863   21819477  NA 9.218868e+18
4: 2012 276  2  0 S1 011  8          3127133583          3127133583 9003982501   0 3.127134e+09

我是否遗漏了有关 integer64 的内容，或者这是一个错误？如上所述，我可以使用integer64="double"，可能会丢失一些精度，如帮助文件中所述。但意外的行为是默认的integer64...

这是在运行 Revolution R 3.0.2 的 Windows 8.1 64 位机器上完成的，也在运行 kubuntu 13.10、CRAN-R 3.0.2 的虚拟机上完成。使用来自 CRAN（截至 2014 年 2 月 7 日的 1.8.10）和 1.8.11（rev. 1110，2014-02-04 02:43:19，从 zip 作为 r-forge 手动安装）的最新稳定 data.table 进行测试构建已损坏）在 Windows 上，只有稳定的 1.8.10 在 linux 上。两台机器上都安装并加载了bit64。

> sessionInfo()
R version 3.0.2 (2013-09-25)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] grid      stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] bit64_0.9-3       bit_1.1-11        gdata_2.13.2      xts_0.9-7         zoo_1.7-10        nlme_3.1-113      hexbin_1.26.3     lattice_0.20-24   ggplot2_0.9.3.1  
[10] plyr_1.8          reshape2_1.2.2    data.table_1.8.11 Revobase_7.0.0    RevoMods_7.0.0    RevoScaleR_7.0.0 

loaded via a namespace (and not attached):
 [1] codetools_0.2-8    colorspace_1.2-4   dichromat_2.0-0    digest_0.6.4       foreach_1.4.1      gtable_0.1.2       gtools_3.2.1       iterators_1.0.6   
 [9] labeling_0.2       MASS_7.3-29        munsell_0.4.2      proto_0.3-10       RColorBrewer_1.0-5 reshape_0.8.4      scales_0.2.3       stringr_0.6.2     
[17] tools_3.0.2

【问题讨论】：

来自帮助页面，“此功能仍在开发中。”所以我希望作者将其视为一个错误
很失望这个问题还没有解决；如果有任何缺失值，它会使 bit64 包对 data.tables 无用。我认为问题一定出在fread，因为我找不到任何方法来强制 bit64 包产生该值。它具有完全有效的 NA 值； as.integer64(NA) # <NA>
查看相关错误，github.com/Rdatatable/data.table/issues/488

标签： r data.table

【解决方案1】：

这个错误，#488，现在在 data.table 的开发版本v1.9.5 中用this commit 修复，并且如果加载了bit64，值将正确分配（和显示）为NA。

require(data.table) # v1.9.5
require(bit64)
ans = fread("test.csv")
#      V1  V2 V3 V4 V5  V6 V7         V8         V9        V10 V11
# 1: 2012 276 NA  0 S1 001  1         NA  724135215 1590915056  NA
# 2: 2012 276  2  8 S1 001  1         NA         NA     154598   0
# 3: 2012 276  2 12 S1 001  1         NA    5118863   21819477  NA
# 4: 2012 276  2  0 S1 011  8 3127133583 3127133583 9003982501   0

【讨论】：

感谢您在此处以及 github 上修复和跟进。
在某些情况下这仍然是一个问题，并且仍未解决（如果没有发生在我身上，我现在不会;)）。 github.com/Rdatatable/data.table/issues/1459

【解决方案2】：

这显然是 bit64 包的问题，而不是 fread() 或 data.table。来自bit64 文档http://cran.r-project.org/web/packages/bit64/bit64.pdf

"当前不支持对不存在的元素下标和使用 NA 下标。这种下标当前返回 9218868437227407266 而不是 NA（底层双代码的 NA 值）。此处遵循完整的 R 行为会破坏性能或需要大量的 C 编码。”

我尝试将 9218868437227407266 值重新分配给 NA，认为它会起作用

例如。

DT[V8==9218868437227407266, ]
#actually returns nothing, but
DT[V8==max(V8), ]
#returns the rows with 9218868437227407266 in V8
#but this does not reassign the value 
DT[V8==max(V8), V8:=NA]
#not that this makes sense, but I tried just in case...
DT[V8==max(V8), V8:=NA_character_]

因此，正如文档非常清楚地指出的那样，如果向量是 integer64 类，它将无法识别 NA 或缺失值。我将避免使用 bit64 只是为了不必处理这个......

【讨论】：

谢谢。我并不真正需要（或使用）那么大的整数并且不知道限制。我猜在解决这个问题之前我不会使用 bit64，在我的情况下，NA 比大整数更频繁。作为记录，DT[as.character(V8)== "9218868437227407266"] 还返回具有较大值的行（即 NA）。此外，DT[as.character(V8)== "9218868437227407266", V8 := as.integer64(NA)] 似乎也能胜任。