我建议从“readr”包中探索read_fwf。你可以这样做:
library(readr)
len <- nchar(readLines("yourfile.txt", n = 1))
read_fwf("yourfile.txt", fwf_widths(rep(1, len)))
或者,您可以尝试the "iotools" package,这可能会更快:
library(iotools)
len <- nchar(readLines("yourfile.txt", n = 1))
input.file("yourfile.txt", formatter = dstrfw,
col_types = rep("integer", len), widths = rep(1, len))
这是一个小的 POC:
a <- tempfile()
writeLines("0101010100101010101010101010
1010101001010101010101010111
1111101010101010100101010101", a)
len <- nchar(readLines(a, n = 1))
library(readr)
read_fwf(a, fwf_widths(rep(1, len)))
# Source: local data frame [3 x 28]
#
# X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17 X18 X19 X20 X21 X22 X23 X24 X25 X26 X27 X28
# 1 0 1 0 1 0 1 0 1 0 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0
# 2 1 0 1 0 1 0 1 0 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 1 1
# 3 1 1 1 1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 0 1 0 1 0 1 0 1 0 1
您的数据维度确实让read_fwf 窒息。我做了一个小测试来比较“iotools”方法和awk + fread。
这里是示例数据:
## Creates a file named "somefile.txt"
set.seed(1)
A <- replicate(10, sample(0:1, 70000, TRUE), FALSE)
A <- sapply(A, paste, collapse = "")
writeLines(rep(A, 800/length(A)), "somefile.txt")
这里是函数和结果。我已经编写了这些函数,因此您应该能够在您的实际数据上尝试它们,看看哪个最适合您。
显然,readr 似乎在现阶段不存在了 :-)
Freadr <- function(infile = "somefile.txt") {
len <- nchar(readLines(infile, n = 1))
read_fwf(infile, fwf_widths(rep(1, len)))
}
system.time(temp1 <- Freadr())
# |===============================================================| 100% 53 MB
# user system elapsed
# 466.740 0.384 466.506
Fiotools <- function(infile = "somefile.txt") {
len <- nchar(readLines(infile, n = 1))
input.file(infile, formatter = dstrfw,
col_types = rep("integer", len), widths = rep(1, len))
}
system.time(temp2 <- Fiotools())
# user system elapsed
# 7.248 0.016 7.273
Fawk <- function(infile = "somefile.txt") {
cmd <- sprintf("awk '{gsub(/./,\"&,\", $1);print $1}' %s", infile)
fread(cmd)
}
system.time(temp3 <- Fawk())
# user system elapsed
# 12.948 0.156 13.109
就此而言,使用 base R 也不错:
fun4 <- function(infile = "somefile.txt") {
do.call(rbind, lapply(strsplit(readLines(infile), "", TRUE), as.numeric))
}
system.time(fun4())
# user system elapsed
# 9.056 0.260 9.304
结果是matrix,因此您可能需要添加几秒钟才能转换为data.frame 或data.table,如果这正是您想要的。