【发布时间】:2014-05-16 05:18:19
【问题描述】:
我导入了一个 CSV 文件(包含文本列和数字列)
x <- fread('myfile.csv', header = TRUE, verbose =T, na.strings = c("null", "'null'", ""))
导入后,当我运行 summary(x) 时,所有列都被视为字符
mycolumn
Length:100000
Class :character
Mode :character
有没有办法让它将数字列识别为数字?详细输出如下(来自使用 nrows 运行),以使其更快。
Input contains no \n. Taking this to be a filename to open
File opened, filesize is 10.162 GB
File is opened and mapped ok
Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.
Looking for supplied sep '\t' on line 30 (the last non blank line in the first 'autostart') ... found ok
Found 166 columns
First row with 166 fields occurs on line 1 (either column names or first row of data)
'header' changed by user from 'auto' to TRUE
Count of eol after first data row: 6513865
Subtracted 1 for last eol and any trailing empty lines, leaving 6513864 data rows
nrow limited to nrows passed in (100000)
Type codes: 4444444444444444444444444444444444444444444444444444444444444444444444444444444444444444441444444444444444444444444444444444444444414444444444444444444444444444444444 (first 5 rows)
Type codes: 4444444444444444444444444444444444444444444444444444444444444444444444444444444444444444441444444444444444444444444444444444444444414444444444444444444444444444444444 (+middle 5 rows)
Type codes: 4444444444444444444444444444444444444444444444444444444444444444444444444444444444444444441444444444444444444444444444444444444444414444444444444444444444444444444444 (+last 5 rows)
Type codes: 4444444444444444444444444444444444444444444444444444444444444444444444444444444444444444441444444444444444444444444444444444444444414444444444444444444444444444444444 (after applying colClasses and integer64)
Type codes: 4444444444444444444444444444444444444444444444444444444444444444444444444444444444444444441444444444444444444444444444444444444444414444444444444444444444444444444444 (after applying drop or select (if supplied)
Allocating 166 column slots (166 - 0 NULL)
Read 100000 rows and 166 (of 166) columns from 10.162 GB file in 00:00:04
0.564s ( 15%) Memory map (rerun may be quicker)
0.001s ( 0%) sep and header detection
1.613s ( 43%) Count rows (wc -l)
0.030s ( 1%) Column type detection (first, middle and last 5 rows)
0.015s ( 0%) Allocation of 100000x166 result (xMB) in RAM
1.437s ( 38%) Reading data
0.000s ( 0%) Allocation for type bumps (if any), including gc time if triggered
0.000s ( 0%) Coercing data already read in type bumps (if any)
0.080s ( 2%) Changing na.strings to NA
3.739s Total
【问题讨论】:
-
手动指定列类的方法是通过
colClasses参数。但是freads应该能够自动猜出数字列,这让我觉得你的数字列中有条目不是数字的。也许您还没有设法捕获所有类型的 NA 值? -
我会仔细检查;空值表示为字符串 null 但我已经在命令中捕获了它。有点怀疑所有列都被解释为字符,一些值来自强制字段(包括主键),所以它们是干净的。我会尝试使用一个子集。
-
@Mattrition 你是对的,如果你把你的评论作为答案,我会接受它。
标签: r data.table