【问题标题】:Can I make this dplyr + data.table task faster?我可以让这个 dplyr + data.table 任务更快吗?
【发布时间】:2013-11-02 08:15:38
【问题描述】:

我想这更像是一个dplyr 而不是plyr 的问题。为了速度,我在我编写的一些代码中使用了data.table。在中间步骤中,我有一个包含大约 32,000 行的基因组数据的表:

> bedbin.dt
Source: local data table [32,138 x 4]
Groups: chr

   bin   start           site chr
1    2 3500000         ssCTCF   1
2    3 4000000 ssCTCF+Cohesin   1
3    3 4000000         ssCTCF   1
4    4 4500000         ucCTCF   1
5    4 4500000 ssCTCF+Cohesin   1
6    4 4500000 ssCTCF+Cohesin   1
7    4 4500000 ssCTCF+Cohesin   1
8    4 4500000         ssCTCF   1
9    4 4500000         ssCTCF   1
10   5 5000000         ssCTCF   1
.. ...     ...            ... ...

编辑

或者像这样的前一百行数据(感谢 Ricardo Saporta 的说明)

bedbin.dt <- data.table(structure(list(bin = c("2", "3", "3", "4", "4", "4", "4", "4","4", "5", "5", "7", "7", "7", "7", "7", "7", "8", "8", "9", "9","11", "12", "14", "14", "14", "14", "14", "14", "14", "14", "15","15", "15", "15", "15", "15", "15", "15", "15", "15", "16", "16","17", "17", "17", "18", "20", "20", "20", "21", "21", "21", "21","21", "21", "21", "21", "21", "21", "22", "22", "5057", "5057","5057", "5057", "5059", "5059", "5059", "5059", "5059", "5060","5060", "5060", "5060", "5060", "5060", "5061", "5063", "5063","5064", "5064", "5064", "5064", "5064", "5064", "5064", "5064","5064", "5064", "5064", "5064", "5064", "5064", "5064", "5064","5064", "5064", "5064", "5064"), start = c(3500000L, 4000000L,4000000L, 4500000L, 4500000L, 4500000L, 4500000L, 4500000L, 4500000L,5000000L, 5000000L, 6000000L, 6000000L, 6000000L, 6000000L, 6000000L,6000000L, 6500000L, 6500000L, 7000000L, 7000000L, 8000000L, 8500000L,9500000L, 9500000L, 9500000L, 9500000L, 9500000L, 9500000L, 9500000L,9500000L, 10000000L, 10000000L, 10000000L, 10000000L, 10000000L,10000000L, 10000000L, 10000000L, 10000000L, 10000000L, 10500000L,10500000L, 11000000L, 11000000L, 11000000L, 11500000L, 12500000L,12500000L, 12500000L, 13000000L, 13000000L, 13000000L, 13000000L,13000000L, 13000000L, 13000000L, 13000000L, 13000000L, 13000000L,13500000L, 13500000L, 162500000L, 162500000L, 162500000L, 162500000L,163500000L, 163500000L, 163500000L, 163500000L, 163500000L, 164000000L,164000000L, 164000000L, 164000000L, 164000000L, 164000000L, 164500000L,165500000L, 165500000L, 166000000L, 166000000L, 166000000L, 166000000L,166000000L, 166000000L, 166000000L, 166000000L, 166000000L, 166000000L,166000000L, 166000000L, 166000000L, 166000000L, 166000000L, 166000000L,166000000L, 166000000L, 166000000L, 166000000L), site = c("ssCTCF","ssCTCF+Cohesin", "ssCTCF", "ucCTCF", "ssCTCF+Cohesin", "ssCTCF+Cohesin","ssCTCF+Cohesin", "ssCTCF", "ssCTCF", "ssCTCF", "ssCTCF+Cohesin","ssCTCF", "ssCTCF+Cohesin", "ssCTCF+Cohesin", "ssCTCF", "ucCTCF","ucCTCF", "ucCTCF", "ssCTCF", "ssCTCF", "ssCTCF+Cohesin", "ssCTCF","ssCTCF+Cohesin", "ssCTCF", "ucCTCF", "ucCTCF", "ssCTCF", "ssCTCF+Cohesin","ssCTCF", "ssCTCF+Cohesin", "ssCTCF+Cohesin", "ssCTCF+Cohesin","ssCTCF+Cohesin", "ssCTCF", "ucCTCF", "ssCTCF+Cohesin", "ssCTCF","ssCTCF", "ssCTCF", "ssCTCF", "ssCTCF", "ssCTCF", "ssCTCF", "ssCTCF","ssCTCF", "ssCTCF", "ucCTCF", "ucCTCF", "ucCTCF", "ssCTCF", "ssCTCF","ssCTCF", "ssCTCF", "ssCTCF+Cohesin", "ssCTCF", "ssCTCF", "ssCTCF","ssCTCF", "ssCTCF", "ssCTCF", "ssCTCF+Cohesin", "ucCTCF", "ssCTCF","ssCTCF+Cohesin", "ssCTCF+Cohesin", "ssCTCF", "ucCTCF", "ssCTCF","ssCTCF+Cohesin", "ssCTCF", "ssCTCF", "ucCTCF", "ucCTCF", "ssCTCF","ucCTCF", "ssCTCF", "ucCTCF", "ucCTCF", "ssCTCF", "ssCTCF", "ucCTCF","ucCTCF", "ssCTCF", "ssCTCF", "ssCTCF", "ucCTCF", "ucCTCF", "ssCTCF","ssCTCF", "ssCTCF", "ssCTCF", "ssCTCF", "ssCTCF", "ssCTCF", "ucCTCF","ucCTCF", "ssCTCF+Cohesin", "ucCTCF", "ucCTCF", "ucCTCF"), chr = structure(c(1L,1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 20L, 20L,20L, 20L, 20L, 20L, 20L, 20L, 20L, 20L, 20L, 20L, 20L, 20L, 20L,20L, 20L, 20L, 20L, 20L, 20L, 20L, 20L, 20L, 20L, 20L, 20L, 20L,20L, 20L, 20L, 20L, 20L, 20L, 20L, 20L, 20L, 20L), .Label = c("1","10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "2","3", "4", "5", "6", "7", "8", "9", "X"), class = "factor")), .Names = c("bin","start", "site", "chr"), sorted = "chr", class = c("data.table","data.frame"), row.names = c(NA, -100L)), key='chr')

结束编辑

接下来我想创建每一行与其他行的所有可能组合(按 chr 分组)。这会在其他数据上形成一个查询(连接),所以我认为最好(也是最简单)预先计算:

# grouped by chr column
bedbin.dt = group_by(bedbin.dt, chr)

# an outer like function
outerFun= function(dt)
  {
   unique(data.table(
    x=dt[rep(1:nrow(dt),each =nrow(dt)),],
    y=dt[rep.int(1:nrow(dt),times=nrow(dt)),]))
  }

> system.time((outer.bedbin.dt = do(bedbin.dt, outerFun1)))
   user  system elapsed 
 90.607  13.993 105.536

在我看来,这太慢了www...虽然与使用data.frameby()lapply() 等基本函数相比,它要快得多。然而,这实际上是我正在测试的一个小型数据集。

所以...我想知道是否有人对更快版本的 outerFun 有任何想法???有没有比rep()rep.int() 更快的方法?

【问题讨论】:

  • 嗨,你能发布一个可重现的例子吗? -- 你可以使用reproduce(&lt;your data&gt;)。说明在这里:bit.ly/SORepro - How to make a great R reproducible example
  • @RicardoSaporta,您好,我不确定如何发布完全可重现的示例。让我考虑一下。也许我可以先编写代码来创建它?请稍等一下……
  • 看看我发布的链接。你可以简单地使用reproduce(bedbin.dt, rows=100, cols=c("bin", "start", ..etc))
  • 不清楚你在做什么,但看起来你可能只是在寻找CJ 函数?
  • 所以你想要一个交叉连接?你能解释一下为什么吗?通常,您不希望组合爆炸您拥有的数据量。

标签: r data.table dplyr


【解决方案1】:

正如 Ricardo 指出的那样,听起来您只是想要这个:

bedbin.dt[, CJ(1:.N, 1:.N), by = chr]

【讨论】:

  • 我认为您应该将其存储在一个变量中,例如 tt,然后执行:cbind(dt[tt$V1], dt[tt$V2]),但不确定。
  • 您好,在测试数据上,这对我自己的函数 (1936) 给出了不同的答案(5288 行) - 而不是我想要的所有列。然而,CJ 看起来确实是一个很有前途的功能......但我在 data.table 语法上真的很弱,我需要去谷歌搜索一些,直到我明白这里发生了什么。谢谢。
  • @StephenHenderson 抱歉,我无法安装 dplyr 并且不确定该代码的作用 - 这是我的最佳猜测 - CJ 大致相同 expand.grid 顺便说一句跨度>
  • @eddi 是的,我认为它接近我想要的并且更快一点..但我不认为它是 1:.N 因为例如原始数据中没有 bin = 1但在你的答案中有。所以我可以使用你所做的 - 不知道 CJ() - 但只需要生成语法来自己解决它。再次感谢
  • @StephenHenderson,如果您获取少量数据并向我们展示您想要的输出,这可能会有所帮助。
猜你喜欢
  • 2017-11-06
  • 2011-11-03
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2021-11-15
  • 2020-07-01
  • 2016-07-12
相关资源
最近更新 更多