使用 data.table 计算：变量中有多少个 2x2 非 NA 值？答案

【问题标题】：Compute with data.table: how many 2x2 non NA values there are among the variables?使用 data.table 计算：变量中有多少个 2x2 非 NA 值？
【发布时间】：2015-07-09 09:11:43
【问题描述】：

假设我有这个 data.table（实际数据是 25061 x 5862）：

require(data.table)
df
  # gene     P1     P2     P3     P4     P5
 # 1: gene1  0.111  0.319  0.151     NA -0.397
 # 2: gene10  1.627  2.252  1.462 -1.339 -0.644
 # 3: gene2 -1.766 -0.056 -0.369  1.910  0.981
 # 4: gene3 -1.346  1.283  0.322 -0.465  0.403
 # 5: gene4 -0.783     NA -0.005  1.761  0.066
 # 6: gene5  0.386 -0.309 -0.886 -0.072  0.161
 # 7: gene6  0.547 -0.144 -0.725 -0.133  1.059
 # 8: gene7  0.785 -1.827  0.986  1.555 -0.798
 # 9: gene8 -0.186     NA  0.401  0.900 -1.075
# 10: gene9 -0.177  1.497 -1.370 -1.628 -1.044

我想知道如何利用 data.table 结构有效地计算每对基因值有多少对没有 NA。例如，对于基因1、基因2，我希望结果为4。

使用基础 R，我这样做：

calc_nonNA <- !is.na(df[, -1, with=F])
Effectifs <- calc_nonNA %*% t(calc_nonNA)
# or, as suggested by @DavidArenburg and @Khashaa, more efficiently:
Effectifs <- tcrossprod(calc_nonNA)

但是，如果 df 很大，则需要几个小时...

我想要的输出，提供的例子是这样的：

       gene1 gene10 gene2 gene3 gene4 gene5 gene6 gene7 gene8 gene9
gene1      4      4     4     4     3     4     4     4     3     4
gene10     4      5     5     5     4     5     5     5     4     5
gene2      4      5     5     5     4     5     5     5     4     5
gene3      4      5     5     5     4     5     5     5     4     5
gene4      3      4     4     4     4     4     4     4     4     4
gene5      4      5     5     5     4     5     5     5     4     5
gene6      4      5     5     5     4     5     5     5     4     5
gene7      4      5     5     5     4     5     5     5     4     5
gene8      3      4     4     4     4     4     4     4     4     4
gene9      4      5     5     5     4     5     5     5     4     5

数据

df <- structure(list(gene = c("gene1", "gene10", "gene2", "gene3", 
"gene4", "gene5", "gene6", "gene7", "gene8", "gene9"), P1 = c(0.111, 
1.627, -1.766, -1.346, -0.783, 0.386, 0.547, 0.785, -0.186, -0.177
), P2 = c(0.319, 2.252, -0.056, 1.283, NA, -0.309, -0.144, -1.827, 
NA, 1.497), P3 = c(0.151, 1.462, -0.369, 0.322, -0.005, -0.886, 
-0.725, 0.986, 0.401, -1.37), P4 = c(NA, -1.339, 1.91, -0.465, 
1.761, -0.072, -0.133, 1.555, 0.9, -1.628), P5 = c(-0.397, -0.644, 
0.981, 0.403, 0.066, 0.161, 1.059, -0.798, -1.075, -1.044)), .Names = c("gene", 
"P1", "P2", "P3", "P4", "P5"), class = c("data.table", "data.frame"
), row.names = c(NA, -10L), .internal.selfref = <pointer: 0x022524a0>)

【问题讨论】：

tcrossprod(x) 比 x%*%t(x) 快
@Frank，我正在使用的当前 df 是 25061 x 5862（包括带有名称的列），但我以后可以使用具有更多行和/或更多列的 dfs。我也在为具有相同行数但列数较少的 df 进行这些计算
哦，哎呀，注意到您在之前的评论中也提到了这一点。可能想在问题本身中加入一些关于此的内容。我猜 NA 的频率也可能是相关的，sum(calc_nonNA)/length(calc_nonNA)。我很想看看人们发现了什么，但我自己没有任何想法。
@Frank 可以进行编辑。我不喜欢你评论的远端;-)。关于 NA，这是相当“随机”的，有些行没有，而有些行有很多......
@CathG，唉，sparse 方法不如那种稀疏性。

标签： r data.table

【解决方案1】：

使用dplyr，将数据宽转换为长，然后加入自身并总结。不确定它是否比您的解决方案更有效，对任何人进行基准测试？

library(dplyr)
library(tidyr)

# reshaping from wide to long
x <- df %>% gather(key = P, value = value, -c(1)) %>% 
  mutate(value=(!is.na(value)))

# result
left_join(x,x,by="P") %>% 
  group_by(gene.x,gene.y) %>% 
  summarise(N=sum(value.x & value.y)) %>% 
  spread(gene.y,N)

编辑： 很遗憾，这个 dplyr 解决方案对于更大的数据集 2600x600 失败，无法加入自身 - internal vecseq reached physical limit，大约 2^31 行...

顺便说一下，这是t 与tcrossprod 的基准：

library(ggplot2)
library(microbenchmark)

op <- microbenchmark(
  BASE_t={
    calc_nonNA <- !is.na(df[, -1, with=F])
    calc_nonNA %*% t(calc_nonNA)
    },
  BASE_tcrossprod={
    calc_nonNA <- !is.na(df[, -1, with=F])
    tcrossprod(calc_nonNA)
  },
  times=10
  )

qplot(y=time, data=op, colour=expr) + scale_y_log10()

【讨论】：

感谢当时的基准测试，所以，我肯定至少在切换或 tcrossprod！

【解决方案2】：

我用 25061x5862 的随机数据对此进行了尝试，它很快就消耗了 50gb 的内存（包括交换空间），因此，它的内存效率比使用 tcrossprod 的方式要低得多，但如果你有大量的内存然后也许（但可能不是）这可能会更快。

#generate cross columns for all matches
crossDT<-data.table(gene=rep(df1[,unique(gene)],length(df1[,unique(gene)])),gene2=rep(df1[,unique(gene)],each=length(df1[,unique(gene)])))
#create datatable with row for each combo
df2<-merge(df1,crossDT,by="gene")
setkey(df2,gene2)
setkey(df1,gene)
#make datatable with a set of P columns for each gene
df3<-df1[df2]
#find middle column and then make name vectors
pivotcol<-match("i.gene",names(df3))
names1<-names(df3)[2:(pivotcol-1)]
names2<-names(df3)[(pivotcol+1):ncol(df3)]
names3<-paste0("new",names1)
#make third set of P columns where the new value is False if either of the previous sets of P columns is NA
df3[,(names3):=lapply(1:length(names1),function(x) !any(is.na(c(get(names1[x]),get(names2[x]))))),by=c("gene","i.gene")]
#delete first sets of P columns
df3[,c(names1,names2):=NULL]
#sum up true columns
df3[,Sum:=rowSums(.SD),.SDcols=names3]
#delete set of P columns
df3[,(names3):=NULL]
#cast results to desired shape
dcast.data.table(df3,gene~i.gene,value.var='Sum')

【讨论】：

非常感谢您的回答。我确认我没有那么多 RAM（更像是 12Gb ;-)），但这仍然很有趣。并感谢其他问题的链接。