【问题标题】:R: data.table compute weighted means of multiple variables with multiple weight variables each, by groupR:data.table 计算多个变量的加权平均值,每个变量具有多个权重变量,按组
【发布时间】:2017-03-09 10:01:08
【问题描述】:

我还是 data.table 的新手。我的问题类似于this onethis one。不同之处在于我想按组计算多个变量的加权平均值,但每个平均值使用多个权重。

考虑以下data.table(实际要大得多):

library(data.table)

set.seed(123456)

mydata <- data.table(CLID = rep("CNK", 10),
                     ITNUM = rep(c("First", "Second", "First", "First", "Second"), 2),
                     SATS = rep(c("Always", "Amost always", "Sometimes", "Never", "Always"), 2),
                     ASSETS = rep(c("0-10", "11-25", "26-100", "101-200", "MORE THAN 200"), 2),
                     AVGVALUE1 = rnorm(10, 10, 2),
                     AVGVALUE2 = rnorm(10, 10, 2),
                     WGT1 = rnorm(10, 3, 1),
                     WGT2 = rnorm(10, 3, 1),
                     WGT3 = rnorm(10, 3, 1))

#I set the key of the table to the variables I want to group by,
#so the output is sorted
setkeyv(mydata, c("CLID", "ITNUM", "SATS", "ASSETS"))

我想要实现的是使用每个权重变量按ITNUMSATSASSETS 定义的组计算AVGVALUE1AVGVALUE2(可能还有更多变量)的加权平均值WGT1WGT2WGT3(可能还有更多)。因此,对于我想计算加权平均值的每个变量,我将按组(或无论权重的数量是多少)获得三个加权平均值。

我可以为每个变量分别做,例如:

all.weights <- c("WGT1", "WGT2", "WGT3")
avg.var <- "AVGVALUE1"
split.vars <- c("ITNUM", "SATS", "ASSETS")

mydata[ , Map(f = weighted.mean, x = .(get(avg.var)), w = mget(all.weights),
na.rm = TRUE), by = c(key(mydata)[1], split.vars)]

我在by 中添加了第一个键变量,尽管它是一个常量,因为我想将它作为输出中的一列。我得到:

   CLID  ITNUM         SATS        ASSETS       V1       V2       V3
1:  CNK  First       Always          0-10 11.66824 11.66819 11.66829
2:  CNK  First        Never       101-200 11.37378 12.21008 11.60182
3:  CNK  First    Sometimes        26-100 12.43004 13.13450 12.01330
4:  CNK Second       Always MORE THAN 200 12.32265 11.81613 12.56786
5:  CNK Second Amost always         11-25 10.76556 11.34669 10.52458

但是,对于实际的data.table,我有更多的列来计算加权平均值(以及要使用的权重更多),一一进行会相当麻烦。我想象的是一个函数,其中每个变量(AVGVALUE1AVGVALUE2 等)的平均值是用每个权重变量(WGT1WGT2WGT3 等)计算的并将计算加权平均值的每个变量的输出添加到列表中。我想这个列表是最好的选择,因为如果所有估计都在同一个输出中,那么列数可能是无穷无尽的。所以是这样的:

[[1]]
   CLID  ITNUM         SATS        ASSETS       V1       V2       V3
1:  CNK  First       Always          0-10 11.66824 11.66819 11.66829
2:  CNK  First        Never       101-200 11.37378 12.21008 11.60182
3:  CNK  First    Sometimes        26-100 12.43004 13.13450 12.01330
4:  CNK Second       Always MORE THAN 200 12.32265 11.81613 12.56786
5:  CNK Second Amost always         11-25 10.76556 11.34669 10.52458

[[2]]
   CLID  ITNUM         SATS        ASSETS        V1        V2        V3
1:  CNK  First       Always          0-10  9.132899  9.060045  9.197005
2:  CNK  First        Never       101-200 12.896584 13.278680 13.000772
3:  CNK  First    Sometimes        26-100 10.972260 11.215390 10.828431
4:  CNK Second       Always MORE THAN 200 11.704404 11.611072 11.749586
5:  CNK Second Amost always         11-25  8.086409  8.225030  8.028928

到目前为止我尝试了什么:

  1. 使用lapply

    all.weights <- c("WGT1", "WGT2", "WGT3")
    avg.vars <- c("AVGVALUE1", "AVGVALUE2")
    split.vars <- c("ITNUM", "SATS", "ASSETS")
    
    lapply(mydata, function(i) {
    mydata[ , Map(f = weighted.mean, x = mget(avg.vars)[i], w = mget(all.weights),
    na.rm = TRUE), by = c(key(mydata)[1], split.vars)]
    })
    
    Error in weighted.mean.default(x = dots[[1L]][[1L]], w = dots[[2L]][[1L]],  : 
     'x' and 'w' must have the same length
    
  2. 使用mapply

    myfun <- function(data, spl.v, avg.v, wgts) {
      data[ , Map(f = weighted.mean, x = mget(avg.v), w = mget(all.weights),
      na.rm = TRUE), by = c(key(data)[1], spl.v)]
    }
    
    mapply(FUN = myfun, data = mydata, spl.v = split.vars, avg.v = avg.vars,
    wgts = all.weights)
    
    Error: value for ‘AVGVALUE2’ not found
    

我试图将 mget(avg.v) 包装为一个列表 - .(mget(avg.v)),但随后又出现了另一个错误:

 Error in mapply(FUN = f, ..., SIMPLIFY = FALSE) : 
  could not find function "." 

有人可以帮忙吗?

【问题讨论】:

    标签: r list data.table weighted-average


    【解决方案1】:

    我们可以使用outer(对两个输入向量中值的所有组合执行函数)对向量化加权均值函数进行操作。通过在数据表范围内定义outer 使用的函数,我们可以让get 评估data.table 列:

    wmeans = mydata[, {
      f  = function(X,Y) weighted.mean(get(X), get(Y));
      vf = Vectorize(f);
      outer(avg.var, all.weights, vf)},
      by = split.vars]
    

    这会将所有方法都放在一个列中(即“长”格式)。我们还可以添加更多列来指定每个值/权重组合所指的:

    wmeans[, mean.v := expand.grid(avg.var, all.weights)[,1]]       
    wmeans[, mean.w := expand.grid(avg.var, all.weights)[,2]]
    head(wmeans)
    #    ITNUM   SATS ASSETS        V1    mean.v mean.w
    # 1: First Always   0-10 11.668243 AVGVALUE1   WGT1
    # 2: First Always   0-10  9.132899 AVGVALUE2   WGT1
    # 3: First Always   0-10 11.668192 AVGVALUE1   WGT2
    # 4: First Always   0-10  9.060045 AVGVALUE2   WGT2
    # 5: First Always   0-10 11.668287 AVGVALUE1   WGT3
    # 6: First Always   0-10  9.197005 AVGVALUE2   WGT3
    

    我们可以使用dcast 将其重塑为一个在avg.var 中很长但在all.weights 中很宽的data.table:

    wide.wmeans = dcast(wmeans, mean.v+ITNUM+SATS+ASSETS ~ mean.w, value.var = "V1")  
    #       mean.v  ITNUM         SATS        ASSETS      WGT1      WGT2      WGT3
    # 1: AVGVALUE1  First       Always          0-10 11.668243 11.668192 11.668287
    # 2: AVGVALUE1  First        Never       101-200 11.373780 12.210083 11.601819
    # 3: AVGVALUE1  First    Sometimes        26-100 12.430039 13.134499 12.013299
    # 4: AVGVALUE1 Second       Always MORE THAN 200 12.322651 11.816135 12.567860
    # 5: AVGVALUE1 Second Amost always         11-25 10.765557 11.346688 10.524583
    # 6: AVGVALUE2  First       Always          0-10  9.132899  9.060045  9.197005
    # 7: AVGVALUE2  First        Never       101-200 12.896584 13.278680 13.000772
    # 8: AVGVALUE2  First    Sometimes        26-100 10.972260 11.215390 10.828431
    # 9: AVGVALUE2 Second       Always MORE THAN 200 11.704404 11.611072 11.749586
    #10: AVGVALUE2 Second Amost always         11-25  8.086409  8.225030  8.028928
    

    如果您需要将其作为列表而不是 data.table,您可以使用

    将其拆分
    lapply(avg.var, function(x) wide.wmeans[mean.v == x])
    # [[1]]
    #       mean.v  ITNUM         SATS        ASSETS     WGT1     WGT2     WGT3
    # 1: AVGVALUE1  First       Always          0-10 11.66824 11.66819 11.66829
    # 2: AVGVALUE1  First        Never       101-200 11.37378 12.21008 11.60182
    # 3: AVGVALUE1  First    Sometimes        26-100 12.43004 13.13450 12.01330
    # 4: AVGVALUE1 Second       Always MORE THAN 200 12.32265 11.81613 12.56786
    # 5: AVGVALUE1 Second Amost always         11-25 10.76556 11.34669 10.52458
    # 
    # [[2]]
    #       mean.v  ITNUM         SATS        ASSETS      WGT1      WGT2      WGT3
    # 1: AVGVALUE2  First       Always          0-10  9.132899  9.060045  9.197005
    # 2: AVGVALUE2  First        Never       101-200 12.896584 13.278680 13.000772
    # 3: AVGVALUE2  First    Sometimes        26-100 10.972260 11.215390 10.828431
    # 4: AVGVALUE2 Second       Always MORE THAN 200 11.704404 11.611072 11.749586
    # 5: AVGVALUE2 Second Amost always         11-25  8.086409  8.225030  8.028928
    

    【讨论】:

      【解决方案2】:

      我。 lapply解决方案

      all.weights <- c("WGT1", "WGT2", "WGT3")
      avg.vars    <- c("AVGVALUE1", "AVGVALUE2")
      split.vars  <- c("ITNUM", "SATS", "ASSETS")
      
      myfun <- function(avg.vars){
        tmp <-
          mydata[ , Map(f = weighted.mean, 
                      x = .(get(avg.vars)), 
                      w = mget(all.weights),
                      na.rm = TRUE), 
                by = c(key(mydata)[1], split.vars)]  
      
        return(tmp) # totally optional, a habit from using C and Java
      }
      
      lapply(avg.vars, myfun)
      

      优点:

      • 使用 *apply
      • 避免循环
      • 比一件一件做要快得多

      缺点:

      • 返回一个列表
      [[1]]
         CLID  ITNUM         SATS        ASSETS       V1       V2       V3
      1:  CNK  First       Always          0-10 11.66824 11.66819 11.66829
      2:  CNK  First        Never       101-200 11.37378 12.21008 11.60182
      3:  CNK  First    Sometimes        26-100 12.43004 13.13450 12.01330
      4:  CNK Second       Always MORE THAN 200 12.32265 11.81613 12.56786
      5:  CNK Second Amost always         11-25 10.76556 11.34669 10.52458
      
      [[2]]
         CLID  ITNUM         SATS        ASSETS        V1        V2        V3
      1:  CNK  First       Always          0-10  9.132899  9.060045  9.197005
      2:  CNK  First        Never       101-200 12.896584 13.278680 13.000772
      3:  CNK  First    Sometimes        26-100 10.972260 11.215390 10.828431
      4:  CNK Second       Always MORE THAN 200 11.704404 11.611072 11.749586
      5:  CNK Second Amost always         11-25  8.086409  8.225030  8.028928
      

      二。 for循环解决方案

      使用简单的for 循环和avg.vars 有2 个值的示例:

      all.weights <- c("WGT1", "WGT2", "WGT3")
      avg.vars    <- c("AVGVALUE1", "AVGVALUE2")
      split.vars  <- c("ITNUM", "SATS", "ASSETS")
      
      result <- data.frame(matrix(nrow=0,ncol=7))
      for(i in avg.vars){
        tmp <- 
          mydata[ , Map(f = weighted.mean, 
                      x = .(get(i)), 
                      w = mget(all.weights),
                      na.rm = TRUE), 
                by = c(key(mydata)[1], split.vars)]  
      
        result <- rbind(result,tmp,use.names=F)
      }
      colnames(result) <- c("CLID", "ITNUM", "SATS", "ASSETS", "V1", "V2", "V3")
      result
      
          CLID  ITNUM         SATS        ASSETS        V1        V2        V3
       1:  CNK  First       Always          0-10 11.668243 11.668192 11.668287
       2:  CNK  First        Never       101-200 11.373780 12.210083 11.601819
       3:  CNK  First    Sometimes        26-100 12.430039 13.134499 12.013299
       4:  CNK Second       Always MORE THAN 200 12.322651 11.816135 12.567860
       5:  CNK Second Amost always         11-25 10.765557 11.346688 10.524583
       6:  CNK  First       Always          0-10  9.132899  9.060045  9.197005
       7:  CNK  First        Never       101-200 12.896584 13.278680 13.000772
       8:  CNK  First    Sometimes        26-100 10.972260 11.215390 10.828431
       9:  CNK Second       Always MORE THAN 200 11.704404 11.611072 11.749586
      10:  CNK Second Amost always         11-25  8.086409  8.225030  8.028928
      

      优点:

      • 在示例中立即完成
      • 无需额外的数据操作/编码即可扩展到任意数量的列
      • 逐一进行会节省大量时间
      • 返回一个不错的data.table
      • 如果你真的想要一个列表,你可以通过将return 初始化为一个列表 (return &lt;- list()),创建一个计数器变量 (n &lt;- 1),然后用 return[n] &lt;- tmp 替换 rbind 语句并递增循环内的计数器 (n &lt;- n + 1)

      缺点:

      • 如果您的数据非常大(例如 > 100,000 行和几十个或更多 avg.var 的值),那么任何循环或使用循环编写的函数的性能都会很差

      【讨论】:

      • 谢谢,但我在lapply(我更喜欢)和forloop` 解决方案中都发现了一个问题。如果再添加一列来计算(比如CRMVAR = rnorm(10, 10, 2))到mydata 的平均值,然后将其添加到avg.varsavg.vars &lt;- c("AVGVALUE1", "AVGVALUE2", "CRMVAR")),该函数将根据需要返回一个包含 3 个分量的列表。但是前 2 个组件的值将与上面的输出不同。因此,输出将取决于您尝试计算均值的列数。在我看来,在这种情况下 lapply 在内部搞砸了一些东西。如何解决这个问题?
      • @panman 这很奇怪。您能否使用新示例和预期输出更新问题,以便我可以重现并解决问题?
      • 哦,对不起,这完全是我的错。我在帖子开头使用原始语法将新变量(CRMVAR)添加到mydt,虽然我使用了相同的种子,但其余变量的值发生了变化(我在 Linux 中使用 R 3.3.1 ),但我将这些值与我已经发布的示例输出中的值进行比较。一切都很好,很抱歉造成混乱。
      猜你喜欢
      • 1970-01-01
      • 2018-10-04
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2021-02-03
      • 2013-04-28
      相关资源
      最近更新 更多