【发布时间】:2018-11-18 11:59:47
【问题描述】:
我有一个恼人的 csv > 10gb,它在 Mac OSX 上打开,但在 Windows 10 上没有。
我使用的代码
data_in <- fread("my_data.csv")
会话信息窗口
R version 3.4.4 (2018-03-15)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
Matrix products: default
locale:
[1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252 LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C
[5] LC_TIME=English_United Kingdom.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.10.4-3 forcats_0.3.0 stringr_1.3.0 dplyr_0.7.4 purrr_0.2.4 readr_1.1.1 tidyr_0.8.0 tibble_1.4.2
[9] ggplot2_2.2.1 tidyverse_1.2.1 RMySQL_0.10.14 DBI_0.8
loaded via a namespace (and not attached):
[1] Rcpp_0.12.16 cellranger_1.1.0 pillar_1.2.1 compiler_3.4.4 plyr_1.8.4 bindr_0.1.1 tools_3.4.4 lubridate_1.7.2 jsonlite_1.5
[10] nlme_3.1-131.1 gtable_0.2.0 lattice_0.20-35 pkgconfig_2.0.1 rlang_0.2.0 psych_1.8.3.3 cli_1.0.0 rstudioapi_0.7 yaml_2.1.18
[19] parallel_3.4.4 haven_1.1.1 bindrcpp_0.2.2 xml2_1.2.0 httr_1.3.1 hms_0.4.2 grid_3.4.4 glue_1.2.0 R6_2.2.2
[28] readxl_1.0.0 foreign_0.8-69 modelr_0.1.1 reshape2_1.4.3 magrittr_1.5 scales_0.5.0 rvest_0.3.2 assertthat_0.2.0 mnormt_1.5-5
[37] colorspace_1.3-2 stringi_1.1.7 lazyeval_0.2.1 munsell_0.4.3 broom_0.4.4 crayon_1.3.4
SessionInfo OSX
R version 3.5.0 (2018-04-23)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
locale:
[1] sv_SE.UTF-8/sv_SE.UTF-8/sv_SE.UTF-8/C/sv_SE.UTF-8/sv_SE.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.11.2
loaded via a namespace (and not attached):
[1] compiler_3.5.0 tools_3.5.0 yaml_2.1.19
我在 Windows 上遇到的错误,我已经尝试了所有建议的解决方案,但没有任何运气。
期望 10 个列,但第 1346596 行包含处理所有列后的文本。使用 fill=TRUE 再试一次。另一个原因可能是 fread 在区分一个或多个在不平衡的非转义引号内嵌入了 sep=',' 和/或(非转义的)'\n' 字符的字段的逻辑失败了。如果 quote='' 没有帮助,请提出问题以确定是否可以改进逻辑。 另外:警告信息:
使用verbose = TRUE (Windows) 时的附加信息(尝试了较小的文件,同样的问题)
Input contains no \n. Taking this to be a filename to open
File opened, filesize is 0.004474 GB.
Memory mapping ... ok
Detected eol as \r\n (CRLF) in that order, the Windows standard.
Positioned on line 1 after skip or autostart
This line is the autostart and not blank so searching up for the last non-blank ... line 1
Detecting sep ... ','
Detected 10 columns. Longest stretch was from line 1 to line 30
Starting data input on line 1 (either column names or first row of data). First 10 characters: ,asin,sale
All the fields on line 1 are character fields. Treating as the column names.
Count of eol: 3657 (including 0 at the end)
Count of sep: 138915
nrow = MIN( nsep [138915] / (ncol [10] -1), neol [3657] - endblanks [0] ) = 3657
Type codes (point 0): 1444444340
Type codes (point 1): 1444444340
Type codes (point 2): 1444444340
Type codes (point 3): 1444444340
Type codes (point 4): 1444444344
Type codes (point 5): 1444444344
Type codes (point 6): 1444444344
Type codes (point 7): 1444444344
Type codes (point 8): 1444444344
Type codes (point 9): 1444444344
Type codes (point 10): 1444444444
Type codes: 1444444444 (after applying colClasses and integer64)
Type codes: 1444444444 (after applying drop or select (if supplied)
Allocating 10 column slots (10 - 0 dropped)
Error in fread("md2.csv", verbose = T) :
Expecting 10 cols, but line 3312 contains text after processing all cols. Try again with fill=TRUE. Another reason could be that fread's logic in distinguishing one or more fields having embedded sep=',' and/or (unescaped) '\n' characters within unbalanced unescaped quotes has failed. If quote='' doesn't help, please file an issue to figure out if the logic could be improved.
verbose = TOSX
nput contains no \n. Taking this to be a filename to open
[01] Check arguments
Using 4 threads (omp_get_max_threads()=4, nth=4)
NAstrings = [<<NA>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as integer
[02] Opening the file
Opening file md2.csv
File opened, size = 4.581MB (4803885 bytes).
Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\n has been found in the input and different lines can end with different line endings (e.g. mixed \n and \r\n in one file). This is common and ideal.
File ends abruptly with ','. Final end-of-line is missing. Using cow page to write 0 to the last byte.
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<,asin,salesRank,imUrl,categori>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep automatically ...
sep=',' with 100 lines of 10 fields using quote rule 0
Detected 10 columns on line 1. This line is either column names or first data row. Line starts as: <<,asin,salesRank,imUrl,categori>>
Quote rule picked = 0
fill=false and the most number of columns found is 10
[07] Detect column types, good nrow estimate and whether first row is column names
Number of sampling jump points = 10 because (4803885 bytes from row 1 to eof) / (2 * 127664 jump0size) == 18
Type codes (jump 000) : 5AAAAAA7A2 Quote rule 0
Type codes (jump 004) : 5AAAAAA7AA Quote rule 0
Type codes (jump 010) : 5AAAAAA7AA Quote rule 0
'header' determined to be true due to column 8 containing a string on row 1 and a lower type (float64) in the rest of the 1041 sample rows
=====
Sampled 1041 rows (handled \n inside quoted fields) at 11 jump points
Bytes from first data row on line 2 to the end of last row: 4803813
Line length: mean=2028.07 sd=3025.66 min=28 max=29901
Estimated number of rows: 4803813 / 2028.07 = 2369
Initial alloc = 4738 rows (2369 + 100%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 0 type and 0 drop user overrides : 5AAAAAA7AA
[10] Allocate memory for the datatable
Allocating 10 column slots (10 - 0 dropped) with 4738 rows
[11] Read the data
jumps=[0..2), chunk_size=2401906, total_size=4803813
Read 3311 rows x 10 columns from 4.581MB (4803885 bytes) file in 00:00.025 wall clock time
[12] Finalizing the datatable
Type counts:
1 : int32 '5'
1 : float64 '7'
8 : string 'A'
=============================
0.001s ( 2%) Memory map 0.004GB file
0.005s ( 19%) sep=',' ncol=10 and header detection
0.000s ( 0%) Column type detection using 1041 sample rows
0.000s ( 0%) Allocation of 4738 rows x 10 cols (0.000GB) of which 3311 ( 70%) rows used
0.019s ( 78%) Reading 2 chunks (0 swept) of 2.291MB (each chunk 1655 rows) using 2 threads
+ 0.004s ( 15%) Parse to row-major thread buffers (grown 0 times)
+ 0.012s ( 48%) Transpose
+ 0.004s ( 15%) Waiting
0.000s ( 0%) Rereading 0 columns due to out-of-sample type exceptions
0.025s Total
【问题讨论】:
-
这是一个巨大的文件,仅供参考,R 完全在内存中运行。您的 Windows 是否有可能没有足够的可用 RAM 来处理加载整个文件?
-
你能分块读吗?
-
能否包含data.table的包版本?还可以尝试使用 verbose=TRUE 运行 fread 以获得额外的调试线索
-
@MichaelChirico,我在原始问题中添加了一些信息。亚当,RAM 不是问题 (32gb)。
-
另外,我仍然看不到软件包版本。鉴于 R 版本不同,我担心 data.table 版本也不同
标签: r csv operating-system data.table fread