在没有标题的R中逐行读取大文件答案

【问题标题】：Read Large File line by line in R without header在没有标题的R中逐行读取大文件
【发布时间】：2012-12-04 20:45:41
【问题描述】：

我在 R 中有一个非常大的数据文件（在 Giga 中），如果我尝试用 R 打开它，我会得到一个内存不足的错误。

我需要逐行读取文件并进行一些分析。我在这个问题上找到了一个先前的问题，其中文件被 n 行读取并用丛跳转到某些行。我使用了“Nick Sabbe”的答案并添加了一些修改以满足我的需要。

考虑到我有以下 test.csv 文件样本：

A    B    C
200 19  0.1
400 18  0.1
300 29  0.1
800 88  0.1
600 80  0.1
150 50  0.1
190 33  0.1
270 42  0.1
900 73  0.1
730 95  0.1

我想逐行读取文件的内容并进行分析。所以我根据“Nick Sabbe”发布的代码创建了以下循环来读取。我有两个问题： 1）每次打印新行时都会打印标题。 2) R 的索引“X”列也被打印了，虽然我正在删除这一列。

这是我正在使用的代码：

test<-function(){
 prev<-0

for(i in 1:100){
  j<-i-prev
  test1<-read.clump("file.csv",j,i)
  print(test1)
  prev<-i

}
}
####################
# Code by Nick Sabbe
###################
read.clump <- function(file, lines, clump, readFunc=read.csv,
                   skip=(lines*(clump-1))+ifelse((header) & (clump>1) & (!inherits(file, "connection")),1,0),
                   nrows=lines,header=TRUE,...){
if(clump > 1){
colnms<-NULL
if(header)
{
  colnms<-unlist(readFunc(file, nrows=1, header=F))
  #print(colnms)
}
p = readFunc(file, skip = skip,
             nrows = nrows, header=FALSE,...)
if(! is.null(colnms))
{
  colnames(p) = colnms
}
} else {
 p = readFunc(file, skip = skip, nrows = nrows, header=header)
}
p$X<-NULL   # Note: Here I'm setting the index to NULL
return(p)
}

我得到的输出：

       A       B    C
1      200      19   0.1
  NA   1       1     1
1  2   400     18   0.1
  NA   1       1    1
1  3   300     29   0.1
  NA   1       1    1
1  4   800     88   0.1
  NA   1       1    1
1  5   600     80   0.1

我想摆脱剩下的阅读：

 NA   1       1     1

另外，有什么方法可以让 for 循环在其他语言的 EOF 文件结束时停止？？？

【问题讨论】：

这似乎非常低效。逐行执行此操作并使用for 迭代器绝对必要吗？使用 R 中的矢量化计算，您肯定可以让生活更轻松吗？

标签： r csv

【解决方案1】：

也许这样的事情可以帮助你：

inputFile <- "foo.txt"
con  <- file(inputFile, open = "r")
while (length(oneLine <- readLines(con, n = 1)) > 0) {
  myLine <- unlist((strsplit(oneLine, ",")))
  print(myLine)
} 
close(con)

或使用扫描以避免拆分为@MatthewPlourde

我使用 scan ：我跳过标题，并且 quiet = TRUE 没有消息说已经有多少项目

while (length(myLine <- scan(con,what="numeric",nlines=1,sep=',',skip=1,quiet=TRUE)) > 0 ){
   ## here I print , but you must have a process your line here
   print(as.numeric(myLine))

}

【讨论】：

+1 我就是这样做的。您可以使用scan 代替readLines，并避免拆分。
@MatthewPlourde 我更新了我的答案，但我想知道 readlines 是否更有效。
我已经尝试了答案。我注意到结果是字符串类型而不是数字类型。还有我如何摆脱阅读 mssge：（阅读 3 项）。输出样本 [1] “A” “B” “C” 读取 3 个项目 [1] “200” “19” “0.1” 读取 3 个项目数字介于“”和读取统计信息之间。感谢您的回复...
@Eng.Mohd 我更新了我的消息，但我建议您阅读帮助？扫描
我可以更改“what= integer() 或 numeric()”，但问题出在列标题?????

【解决方案2】：

我建议您查看chunked 和disk.frame。它们都具有读取 CSV 的功能。

disk.frame::csv_to_disk.frame 可能是您想要的功能。

【讨论】：