【问题标题】:split dataset into multiple datasets with random columns in r将数据集拆分为具有 r 中随机列的多个数据集
【发布时间】:2012-05-31 21:57:22
【问题描述】:

我有一个大数据集。我想分成“n”个子数据集,每个子​​数据集大小相等“s”。但是,如果最后一个数据集不能被数字整除,则它可能小于其他大小。并将它们作为 csv 文件输出到工作目录。

让我们说以下小例子:

set.seed(1234)
mydf <- data.frame (matrix(sample(1:10, 130, replace = TRUE), ncol = 13))
mydf

   X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13
1   3  7  1  9  6  4  7  5  8   2   2   2   8
2   5  3  4  6  9  5  3 10  5   8  10   2  10
3   4  6 10  4  4  6  3  4  2   9   9   2   9
4  10 10  9  4  3  7  7  7 10   6   7  10   2
5  10  3  9  3  2 10  9  6  4   4   4   6   3
6   7  2  8  7  5  5 10 10  9   3   7   8   4
7   3  2  2  7 10  9  2  2 10   1   1  10   4
8   3  9  9  7  3  1  7  6 10   3  10   3   2
9   9  3  6  9  3  2  2  3  4   2   9  10  10
10  6  4  3  3  5  9  3  9 10   7   4   6  10

我想创建一个函数,将数据集随机分成 n 个子集(在这种情况下,假设大小为 3,因为有 13 列 - 最后一个数据集将有 1 列,其余 4 列各有 3)并输出为文本文件作为单独的数据集。

这是我所做的:

set.seed(123)
reshuffled <- sample(1:length(mydf),length(mydf), replace = FALSE)
# just crazy manual divide 
group1 <- reshuffled[1:3]; group2 <- reshuffled[4:6]; group3 <- reshuffled[7:9]
group4 <- reshuffled[10:12]; group5 <-  reshuffled[13]

# just manual 
data1 <- mydf[,group1]; data2 <- mydf[,group2]; ....so on;
# I want to write dimension of dataset at fist row of each dataset 
cat (dim(data1))
write.csv(data1, "data1.csv");  write.csv(data2, "data2.csv"); .....so on 

是否可以循环该过程,因为我必须生成 100 个子数据集?

【问题讨论】:

    标签: r dataset split sample


    【解决方案1】:

    也许有更清洁和更简单的解决方案,但您可以尝试以下方法:

    mydf <- data.frame (matrix(sample(1:10, 130, replace = TRUE), ncol = 13))
    
    ## Number of columns for each sub-dataset
    size <- 3
    
    nb.cols <- ncol(mydf)
    nb.groups <- nb.cols %/% size
    reshuffled <- sample.int(nb.cols, replace=FALSE)
    groups <- c(rep(1:nb.groups, each=size), rep(nb.groups+1, nb.cols %% size))
    dfs <- lapply(split(reshuffled, groups), function(v) mydf[,v,drop=FALSE])
    
    for (i in 1:length(dfs)) write.csv(dfs[[i]], file=paste("data",i,".csv",sep=""))
    

    【讨论】:

      【解决方案2】:

      只是为了好玩,可能比juba的慢

      mydf <- data.frame (matrix(sample(1:10, 130, replace = TRUE), ncol = 13))
      size <- 3
      by(t(mydf), 
         INDICES=sample(as.numeric(gl((ncol(mydf) %/% size) + 1, size, ncol(mydf))), 
                        ncol(mydf), 
                        replace=FALSE), 
         FUN=function(x) write.csv(t(x), paste(rownames(x), collapse='-'), row.names=F))
      

      【讨论】:

        【解决方案3】:

        为了将“mydf”分成 n 个几乎相等的部分,我从 这个问题和相应的答案: link.

        它创建分区大小,其中最小和之间的差异 最大分区尽可能小。在这个例子中,这个差等于 1。例子:

        分区方法 1 - 使用“地板”功能(此处未显示可重现的代码)。将 100 行除以 7 个几乎相等的部分/总和,然后在前 6 次迭代中采样 floor(100/7) = 14 个索引。第 7 个元素是余数。这产生:

        14、14、14、14、14、14、16。总和 = 100,最大差 = 2

        分区方法 2 - 使用“天花板”功能(此处未显示可重现的代码)。使用“天花板”功能而不是“地板”功能会产生类似的结果:

        15、15、15、15、15、15、10。总和 = 100,最大差值 = 5

        分区方法 3 - 使用上面参考中的公式。使用以下过程时,分区大小的向量('sequence_diff')为:

        14、14、14、15、14、14、15。总和 = 100,最大差值 = 1

        R代码:

        set.seed(1234)
        #I increased the number of rows in the data frame to 100
        mydf <- data.frame (matrix(sample(x = 1:100, size = 1300, replace = TRUE), 
                            ncol = 13))
        
        index_list      <- list()       #Will store the indices for all partitions
        indices         <- 1:nrow(mydf) #Initially contains all indices for the dataset 'mydf'
        numb_partitions <- 7            #Specifies the number of partitions
        
        sequence <- floor(((nrow(mydf)*1:numb_partitions)/numb_partitions))
        sequence <- c(0, sequence)
        
        #'sequence_diff' will contain the number of instances for each partition.
        sequence_diff <- vector()
        for(j in 1:numb_partitions){
            sequence_diff[j] <- sequence[j+1] - sequence[j]   
        }  
        
        #Inspect 'sequence_diff' and verify it's elements sum up to the total 
        #number of rows in 'mydf' (100).
        > sequence_diff
        [1] 14 14 14 15 14 14 15
        > sum(sequence_diff)
        [1] 100 #Correct!
        
        for(i in 1:numb_partitions){
        
          #Use a different seed for each sampling iteration.
          set.seed(seed = i)
        
          #Sample from object 'indices' of size 1/'numb_partitions'
          indices_partition <- sample(x = indices, 
                                      size = sequence_diff[i], 
                                      replace = FALSE)
        
          #Remove the selected indices from 'indices' so these indices will not be 
          #selected in successive iterations.
          indices           <- setdiff(x = indices, y = indices_partition)
        
          #Store the indices for the i-th iteration in the list 'index_list'. This 
          #is just to verify later that 
          #the procedure has divided all indices in 'numb_partitions' disjunct sets.
          index_list[[i]]   <- indices_partition
        
          #Dynamically create a new object that is named 'mydfx' in which x is the 
          #i-th partition. 
          assign(x = paste0("mydf", i), value = mydf[indices_partition,])
        
          write.csv(x = get(x = paste0("mydf", i)),  #Dynamically get the object from environment.
                    file = paste0("mydf", i,".csv"), #Dynamically assgin a name to the csv-file.
                    sep = ",", 
                    col.names = T, 
                    row.names = FALSE    
        }
        
        #Check whether all index subsets are mutually exclusive: union should have 100 
        #unique elements. 
        length(unique(unlist(index_list)))
        [1] 100 #Correct!
        

        【讨论】:

          猜你喜欢
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 2019-11-18
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 2014-12-02
          相关资源
          最近更新 更多