【问题标题】:How do I fill in a matrix (by chunks) using a while loop?如何使用 while 循环填充矩阵(按块)?
【发布时间】:2021-11-16 11:10:54
【问题描述】:

我正在尝试读取大型数据集的块: 找到每个块的平均值(代表更大的列) 将平均值添加到矩阵列中 然后找到平均值的平均值给我列的整体平均值。 我已经设置好了,但是我的 while 循环没有重复它的循环。我认为这可能与我如何指代“块”和“块”有关。

这是在R中使用“iris.csv”的一种做法

fl <- file("iris.csv", "r")
clname <- readLines(fl, n=1) # read the header
r <- unlist(strsplit(clname,split = ","))
length(r) # get the number of columns in the matrix
cm <- matrix(NA, nrow=1000, ncol=length(r)) # need a matrix that can be filled on each #iteration.
numchunk = 0 #set my chunks of code to build up
while(numchunk <= 0){ #stop when no more chunks left to run
  numchunk <- numchunk + 1 # keep on moving through chunks of code
  x <- readLines(fl, n=100) #read 100 lines at a time
  chunk <- as.numeric(unlist(strsplit(x,split = ","))) # readable chunk of code
  m <- matrix(chunk, ncol=length(r), byrow = TRUE) # put chunk in a matrix
  cm[numchunk,] <- colMeans(m) #get the column means of the matrix and fill in larger matrix
  print(numchunk) # print the number of chunks used
}
cm
close(fl)
final_mean <- colSums(cm)/nrow(cm)
return(final_mean)

-- 这在我设置我的 n = 1000 时有效,但我希望它适用于更大的数据集,其中 while 需要继续运行。 谁能帮我纠正这个问题?

【问题讨论】:

    标签: r matrix while-loop large-data readlines


    【解决方案1】:

    也许这有帮助

    clname <- readLines(fl, n=1) # read the header
    r <- unlist(strsplit(clname,split = ","))
    length(r) # get the number of columns in the matrix
    cm <- matrix(NA, nrow=1000, ncol=length(r)) # 
    numchunk = 0 
    flag <- TRUE
    while(flag){ 
      numchunk <- numchunk + 1 # keep on moving through chunks of code
      x <- readLines(fl, n=5) 
      print(length(x))
      if(length(x) == 0) {
          flag <- FALSE
          } else {
      
           
      
      chunk <- as.numeric(unlist(strsplit(x,split = ","))) # readable chunk of code
      m <- matrix(chunk, ncol=length(r), byrow = TRUE) # put chunk in a matrix
      cm[numchunk,] <- colMeans(m) #get the column means of the matrix and fill in larger matrix
      print(numchunk) # print the number of chunks used
      }
      
    }
    cm
    close(fl)
    final_mean <- colSums(cm)/nrow(cm)
    

    【讨论】:

      【解决方案2】:

      首先,定义一个辅助函数r2v() 将原始行拆分为有用的向量可能会有所帮助。

      r2v <- Vectorize(\(x) {
        ## splits raw lines to vectors
        strsplit(gsub('\\"', '', x), split=",")[[1]][-1]
        })
      

      打开文件后,使用system() 和 bash 命令检查文件大小,无需读取它(对于 Windows,请参阅there。)

      ## open file
      f <- 'iris.csv'
      fl <- file(f, "r")
      
      ## rows
      (nr <- 
          as.integer(gsub(paste0('\\s', f), '', system(paste('wc -l', f), int=T))) - 1)
      # nr <- 150  ## alternatively define nrows manually
      # [1] 150
      
      ## columns
      nm <- readLines(fl, n=1) |> r2v()
      (nc <- length(nm))
      # [1] 5
      

      接下来,定义可以划分行的块大小。

      ## define chunk size
      ch_sz <- 50
      stopifnot(nr %% ch_sz == 0)  ## all chunks should be filled
      

      然后,使用replicate(),我们逐块计算rowMeans()(因为我们得到了转置的块),最后rowMeans() 再次对所有内容进行计算,以获得整个矩阵的列均值。

      ## calculate means chunk-wise
      final_mean <-
        replicate(nr / ch_sz, 
                  rowMeans(type.convert(r2v(readLines(fl, n=ch_sz)), as.is=TRUE))) |>
        rowMeans()
      close(fl)
      

      兽医验证结果。

      ## test
      all.equal(final_mean, as.numeric(colMeans(iris[-5])))
      # [1] TRUE
      

      数据:

      iris[-5] |>
        write.csv('iris.csv')
      

      【讨论】:

        猜你喜欢
        • 2013-01-19
        • 2011-09-29
        • 1970-01-01
        • 2021-07-11
        • 1970-01-01
        • 2017-12-01
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多