r 循环用于过滤每一列答案

【问题标题】：r loop for filtering through each columnr 循环用于过滤每一列
【发布时间】：2020-06-02 23:29:42
【问题描述】：

我有一个这样的数据框： gene expression data frame 假设列名是不同的样本，行名是不同的基因。现在我想知道从每一列中过滤后剩下的基因数量例如，

sample1_more_than_5 <- df[(df[,1]>5),]
sample1_more_than_10 <- df[(df[,1]>10),]
sample1_more_than_20 <- df[(df[,1]>20),]
sample1_more_than_30 <- df[(df[,1]>30),]

那么，

sample2_more_than_5 <- df[(df[,2]>5),]
sample2_more_than_10 <- df[(df[,2]>10),]
sample2_more_than_20 <- df[(df[,2]>20),]
sample2_more_than_30 <- df[(df[,2]>30),]

但我不想重复这 100 次，因为我有 100 个样本。任何人都可以为这种情况写一个循环吗？谢谢

【问题讨论】：

你能把它设为minimal reproducible example吗？

标签： r dataframe

【解决方案1】：

这是一个使用两个循环的解决方案，它按每个样本（列）计算值大于nums 向量中指示的值的基因（行）的数量。

#Create the vector with the numbers used to filter each columns
nums<-c(5, 10, 20, 30)

#Loop for each column
resul <- apply(df, 2, function(x){
  #Get the length of rows that have a higher value than each nums entry
  sapply(nums, function(y){
    length(x[x>y])
  })
})

#Transform the data into a data.frame and add the nums vector in the first column
resul<-data.frame(greaterthan = nums,
                  as.data.frame(resul))

【讨论】：

这正是我所需要的。太感谢了。祝您有美好的一天！

【解决方案2】：

我们可以遍历列并执行此操作并使用cut 创建分组

lst1 <- lapply(df, function(x) split(x, cut(x, breaks = c(5, 10, 20, 30))))

或findInterval 然后split

lst1 <- lapply(df, function(x) split(x, findInterval(x,  c(5, 10, 20, 30))))

如果我们按照 OP 帖子中创建对象的方式进行，那么全局环境中将有 100 * 4 即 400 个对象（100 列）。相反，它可以是单个 list 对象。

对象可以创建，但不推荐

v1 <- c(5, 10, 20, 30)
v2 <- seq_along(df)
 for(i in v2) {
     for(j in v1) {
      assign(sprintf('sample%d_more_than_%d', i, j), 
               value = df[df[,i] > j,, drop = FALSE])
    }
  }

【讨论】：

感谢您的回答。 for 循环有效。但是对于 lst1 ，它没有生成大于 30 的数字。只给出了具有 5-10、10-20、20-30 测序读数的基因的数量。
@JinyongHuang 我认为你需要在末尾加上Inf，即c(5, 10, 20, 30, Inf)