【问题标题】:R data.table fast sum of random samples by rowR data.table 按行对随机样本进行快速求和
【发布时间】:2019-09-16 06:50:02
【问题描述】:

我有:

require(data.table)

dataDT <- data.table(ID = 1:4, Num_Times = c(7, 9, 10, 13))
dataDT              # the main data
   ID Num_Times
1:  1         7
2:  2         9
3:  3        10
4:  4        13


probabilityDT <- data.table(val = 1:3, prob = c(0.5, 0.3, 0.2))
probabilityDT       # the probabilty matrix
   val prob
1:   1  0.5
2:   2  0.3
3:   3  0.2

我想做以下事情:

对每一行进行采样并计算总和

valTemp <- c()
set.seed(999)
for (i in 1:nrow(dataDT)) {

  # sample size
  num_times <- dataDT[i, Num_Times]

  # get samples
  Temp1 <- sample(x = probabilityDT[["val"]], size = num_times, replace = TRUE, prob = probabilityDT[["prob"]])

  # get sum
  Temp1 <- sum(Temp1)

  valTemp <- c(valTemp, Temp1)
}

dataDT[, sample_sum := valTemp]
dataDT
   ID Num_Times sample_sum
1:  1         7         12
2:  2         9         14
3:  3        10         20
4:  4        13         25

如何更有效地做到这一点?我有大约 500k 行。 这个操作完全可以向量化吗?

【问题讨论】:

    标签: r sum data.table row sampling


    【解决方案1】:

    看看这是否更快:

    set.seed(999)
    sample_all <- sample(probabilityDT[["val"]], #
                         size = sum(dataDT[["Num_Times"]]), #get all values sampled at once
                         TRUE, probabilityDT[["prob"]])
    
    res <- data.table(sample_all, ID = rep(dataDT[["ID"]], dataDT[["Num_Times"]]))
    res <- res[, .(sample_sum = sum(sample_all)), by = "ID"]
    
    dataDT[res, sample_sum := i.sample_sum, on = "ID"]
    #   ID Num_Times sample_sum
    #1:  1         7         12
    #2:  2         9         14
    #3:  3        10         20
    #4:  4        13         25
    

    【讨论】:

      猜你喜欢
      • 2016-11-01
      • 2022-12-09
      • 2019-12-28
      • 1970-01-01
      • 2018-10-24
      • 2021-12-04
      • 2016-03-15
      • 2020-03-12
      • 1970-01-01
      相关资源
      最近更新 更多