将字符向量作为参数传递给 plyr 中的函数答案

【问题标题】：Passing a character vector as arguments to a function in plyr将字符向量作为参数传递给 plyr 中的函数
【发布时间】：2013-02-12 18:50:42
【问题描述】：

我怀疑我做错了，但我想将字符向量作为参数传递给ddply 中的函数。有很多关于删除引号等的问答，但似乎都不适合我（例如Remove quotes from a character vector in R 和http://r.789695.n4.nabble.com/Pass-character-vector-to-function-argument-td3045226.html）。

# reproducible data
df1<-data.frame(a=sample(1:50,10),b=sample(1:50,10),c=sample(1:50,10),d=(c("a","b","c","a","a","b","b","a","c","d")))
df2<-data.frame(a=sample(1:50,9),b=sample(1:50,9),c=sample(1:50,9),d=(c("e","f","g","e","e","f","f","e","g")))
df3<-data.frame(a=sample(1:50,8),b=sample(1:50,8),c=sample(1:50,8),d=(c("h","i","j","h","h","i","i","h")))

#make a list
list.1<-list(df1=df1,df2=df2,df3=df3)

# desired output
lapply(list.1, function(x)   ddply(x, .(d), function(x)  data.frame(am=mean(x$a), bm=mean(x$b), cm=mean(x$c))))

$df1
  d       am       bm       cm
1 a 31.00000 29.25000 18.50000
2 b 31.66667 24.33333 34.66667
3 c 18.50000  5.50000 24.50000
4 d 36.00000 39.00000 43.00000

$df2
  d       am       bm cm
1 e 18.25000 32.50000 18
2 f 27.66667 41.33333 24
3 g 25.00000  7.50000 42

$df3
  d       am       bm       cm
1 h 36.00000 25.00000 20.50000
2 i 25.33333 37.33333 24.33333
3 j 32.00000 32.00000 46.00000

但我的实际用例有许多新列和不同类型的计算，我想在 ddply 函数中进行计算。所以我想做这样的事情：

# here's a simple version of a function that I want to send to ddply    
func <- "am=mean(x$a), bm=mean(x$b), cm=mean(x$c)"

# here's how I imagine it might work
lapply(list.1, function(x)   ddply(x, .(d), function(x)  data.frame(func)) )

# not the desired outcome... 
$df1
  d                                     func
1 a am=mean(x$a), bm=mean(x$b), cm=mean(x$c)
2 b am=mean(x$a), bm=mean(x$b), cm=mean(x$c)
3 c am=mean(x$a), bm=mean(x$b), cm=mean(x$c)
4 d am=mean(x$a), bm=mean(x$b), cm=mean(x$c)

$df2
  d                                     func
1 e am=mean(x$a), bm=mean(x$b), cm=mean(x$c)
2 f am=mean(x$a), bm=mean(x$b), cm=mean(x$c)
3 g am=mean(x$a), bm=mean(x$b), cm=mean(x$c)

$df3
  d                                     func
1 h am=mean(x$a), bm=mean(x$b), cm=mean(x$c)
2 i am=mean(x$a), bm=mean(x$b), cm=mean(x$c)
3 j am=mean(x$a), bm=mean(x$b), cm=mean(x$c)

我已经尝试过noquote、deparse、eval(as.symbol())、do.call(data.frame, ...) 以及这里的一些方法：https://github.com/hadley/devtools/wiki/Evaluation on func 无济于事。此时解决方案可能很明显（即融化所有东西！），但如果不是，这里有一个更接近我的用例的更长示例：

# sample data
s <- 23 # number of samples
r <- 10 # number of runs per sample
el <- 17 # number of elements
mydata <- data.frame(ID = unlist(lapply(LETTERS[1:s], function(x) rep(x, r))),
                     run = rep(1:r, s))
# insert fake element data
mydata[letters[1:el]] <- lapply(1:el, function(i) rnorm(s*r, runif(1)*i^2))

# generate all combinations of 5 runs from  ten runs
su <- 5 # number of runs to sample from ten runs
idx <- combn(unique(mydata$run), su)

# RSE function
RSE <- function(x) {100*( (sd(x)/sqrt(length(x)))/mean(x) )}

# make a list of dfs for all samples for each combination of five runs
# to prepare to calculate RSEs
combys1 <- lapply(1:ncol(idx), function(i) mydata[mydata$run %in% idx[,i],] )

# make a list of dfs with RSE for each ID, for each combination of runs
combys2 <- lapply(1:length(combys1), function(i) ddply(combys1[[i]], "ID", summarise, RSEa=RSE(a), RSEb=RSE(b), RSEc=RSE(c), meana=mean(a), meanb=mean(b), meanc=mean(c)))

我想将上面最后一行中的 RSEa=RSE(a), RSEb=RSE(b), RSEc=RSE(c), meana=mean(a), meanb=mean(b), meanc=mean(c) 替换为此处的对象 doRSE，以避免大量输入：

# prepare to calculate new colums with RSE and means
RSEs <- sapply(3:ncol(mydata), function(j) paste0("RSE",names(mydata[j]))) 
RSExs <- sapply(3:ncol(mydata), function(j) paste0("RSE(",names(mydata[j]),")")) 
doRSE <- paste0(sapply(1:length(RSEs), function(x) paste0(RSEs[x],"=",RSExs[x])), collapse=",", sep="")

我对涉及基础、data.table 和肮脏技巧的解决方案持开放态度。似乎这些接近我想要的，但我不能完全将它们转化为我的问题： Pass character argument and evaluate, Force evaluation of multiple variables using vector of character, Using a vector of characters that correspond to an expression as an argument to a function

更新这里有一个问题：我希望能够修改简单示例中的func（或我的用例中的doRSE）来创建一堆新的列从现有列的各种计算中探索数据。我想要一个允许生成的数据帧具有原始数据帧中没有的新列的工作流。抱歉，原始问题中没有更清楚。我看不出如何调整@Marius 的答案来做到这一点，但@mnel 很有帮助（请参阅下面的更新）

通过 @mnel 出色的肮脏技巧，通过一些小修复，我可以在我的用例中获得所需的结果：

# @mnel's solution, adapted (no period before eval)
combys2 <- lapply(combys1, function(x) do.call(ddply,c(.data = quote(x), 
                           .variables = quote(.(ID)), .fun = quote(summarize),
                           eval(parse(text = sprintf('.(%s)', doRSE ))))))
head(combys2)

[[1]]
   ID       RSEa      RSEb     RSEc      RSEd     RSEe      RSEf     RSEg      RSEh      RSEi
1   A  168.30658  21.68632 5.657228  5.048057 4.162017 2.9581874 1.849009 0.6925148 0.4393491
2   B   26.55071  26.20427 4.782578  4.385409 2.342764 2.1813874 2.719625 1.1576681 0.6427935
3   C   73.83165  14.47216 8.154435  6.273202 3.046978 1.2179457 2.811405 1.1401837 0.8167067
4   D   31.96170  57.89260 9.438220  7.388410 3.755772 0.8601780 3.724875 0.8358204 0.9939387
5   E   63.22537  60.35532 5.839690 11.691304 3.828430 0.9217787 4.204300 0.8217187 0.7876634
6   F   56.37635  65.37907 4.149568  5.496308 2.227544 2.1548455 2.847291 1.1956212 0.2506518
7   G   69.32232  23.63214 4.255847  7.979225 4.917660 1.6185960 3.156521 0.3265555 0.8133279
8   H   29.82015  40.74184 7.372100  7.464792 2.749862 0.6054420 4.061368 0.9973909 1.3807720
9   I   50.58114  19.53732 2.989920  9.767678 4.000249 1.7451322 1.175397 0.9952093 0.9095086
10  J   92.96462  39.77475 6.140688 10.295668 3.407726 2.4663758 3.030444 0.5743419 0.9296482
11  K   90.72381  42.25092 2.483069  6.781054 3.142082 1.8080633 2.891740 1.1996176 0.8525290
12  L -385.24547  40.81267 4.506087  8.148382 2.976488 0.8304432 2.234134 0.2108664 0.4979777
13  M   22.77743  33.98332 2.913926  8.764639 2.307293 0.8366635 3.229944 1.0003125 0.3878567
14  N   66.75163  34.16087 6.611326 13.865377 1.285522 1.3863958 4.165575 0.7379386 0.4515194
15  O   37.37188 100.57479 5.738877  5.724862 2.839638 1.1366610 3.186332 0.7383855 0.3954544
16  P   17.08913  26.62210 6.060130  4.110893 2.688908 2.6970727 1.609043 1.3860834 0.8780010
17  Q   13.96392  74.92279 5.469304  8.467638 2.974131 1.2135436 3.284564 0.6232778 1.0759226
18  R   42.59899  30.75952 4.842832  8.764158 1.874020 1.5791048 3.427342 1.4479638 0.2964455
19  S   26.03307  15.56352 6.968717  7.783876 4.439733 2.0764179 4.683080 0.7459654 1.1268772
20  T   71.57945  33.81362 7.147049 11.201551 2.128315 2.2051611 2.419805 0.2688807 1.1559635
21  U   73.93002  11.77155 7.738910  7.207041 1.478491 1.4409844 4.042419 0.5883490 0.5585716
22  V   67.93166  39.54994 5.701551  8.636122 2.472963 1.6514199 2.627965 1.0359048 0.8747136
23  W   11.23057  12.51272 7.003448  7.424559 4.102693 0.6614847 2.246305 1.3422405 0.2665246
        RSEj      RSEk      RSEl      RSEm      RSEn      RSEo      RSEp      RSEq
1  0.6366733 0.3713819 2.1993487 0.3865293 0.5436581 0.9187585 0.4344699 0.8915868
2  0.3445095 0.2932025 1.8563179 0.5397595 1.0433388 0.3533622 0.1942316 0.1941072
3  0.2720344 0.5507595 2.0305726 0.4377259 0.8589854 0.5690906 0.1397337 0.4043247
4  0.6606667 0.6769112 3.4737352 0.5674656 1.2519256 0.8718298 0.1162969 0.8287504
5  0.4620774 0.5598069 1.9236112 0.7990046 0.9832732 0.6847352 0.4070675 0.9005185
6  0.7981610 0.4005493 0.9721068 0.2770989 1.7054674 0.3110139 0.4521183 0.8740444
7  0.3969116 0.4717575 4.1341106 0.7510628 0.9998299 0.5342292 0.4319642 1.1861705
8  0.2963956 0.2652221 0.4775827 0.2617120 0.8261874 0.5266087 0.1900943 0.2350553
9  0.2609359 0.5431035 2.6478440 0.1606919 0.7407281 0.6802262 0.1802069 0.7438792
10 0.4239787 0.8753544 3.4218030 0.5467869 0.7404017 0.5581173 0.3682014 0.6361436
11 0.4188502 0.8629862 4.4181479 0.1623873 0.8018811 0.5873609 0.3592134 0.5357984
12 0.5790265 0.5009210 3.7534287 0.1933726 0.5809601 0.5777868 0.3400925 0.4783890
13 0.3562582 0.2552756 2.1393219 0.1849345 0.5796194 0.6129469 0.3363311 0.4382125
14 0.7921502 0.6147990 2.9054634 0.5852325 1.4954072 0.9983203 0.2937837 0.7654504
15 0.5840424 0.2757707 1.5695675 0.3305385 0.8712636 0.5816490 0.1985457 0.7213289
16 0.3301280 0.3008273 2.9014987 0.4540833 0.5966479 0.9042004 0.1631630 0.7262141
17 0.5882511 0.2820978 3.0652666 0.4518936 1.3168151 0.4749311 0.2244693 0.6583083
18 0.4048816 0.3708787 3.2207478 0.2603412 1.3168318 0.3318745 0.3120436 0.6210711
19 0.4425123 0.3602076 3.7609863 0.5399527 0.8302572 0.3246904 0.1952143 0.2915325
20 0.5877835 0.6339015 1.6908570 0.3223056 0.5239339 0.6607198 0.2808094 0.3697380
21 0.4454056 0.7733354 4.3433420 0.4391075 0.5503594 0.5893406 0.2262403 0.2361512
22 0.9583940 0.6365843 3.0033951 0.6507968 0.8610046 0.6363198 0.2866719 0.5736855
23 0.4969730 0.3895182 2.0021608 0.3354475 1.4398250 0.7386870 0.2458906 0.3414804
...
...

【问题讨论】：

我不关注。为什么不直接编写一个计算所有这些新列的函数并在 ddply 中使用呢？
你能告诉我那个函数的样子吗？
等等，什么？如果你不知道函数会是什么样子，即它会做什么，我怎么能？
也许stackoverflow.com/q/14721592/1385941 感兴趣
@hadley，感谢您的光临。我希望 ddply 返回由一组 colwise 操作产生的新列。例如，RSE、mean、RSD 以及所有数字列上的其他自定义函数。 colwise 可以接受函数列表吗？

标签： r function vector plyr argument-passing

【解决方案1】：

您可以使用 quote 和 plyr::. 对语言进行一些丑陋的计算

阅读https://github.com/hadley/devtools/wiki/Computing-on-the-language 可能有助于了解您是否真的想这样做。

无论如何，可以使用一种方法

使用.() 创建你的参数向量，例如并使用summary 的工作原理

.(am=mean(a), bm=mean(b), cm=mean(c))

如果你真的想使用字符串

foo<- "am=mean(a), bm=mean(b), cm=mean(c)"
eval(parse(text = sprintf('.(%s)', foo )))

大量使用quote 来创建要传递给do.call 的列表

例如

lapply(list.1, function(x) do.call(ddply,c(.data = quote(x), 
    .variables = quote(.(d)), .fun = quote(summarize),
      .(am=mean(a), bm=mean(b), cm=mean(c)))))

哦，男孩真丑。

或者，您可以使用data.tables

library(data.table)


listDT <- lapply(list.1, data.table)


lapply(listDT, function(x) x[,lapply(.SD, mean), by = 'd'])

或

mystuff <- sprintf('list(%s)', foo)
lapply(listDT, function(x) x[, eval(parse(text = mystuff)), by = 'd'])

但是，如果您在所有 data.tables 中都有所有相同的列，那么创建一个大型 data.table（为列表的每个元素都有一个标识符）并处理它会更有效。

【讨论】：

你最好希望哈德利没有看到这个。 ;)
+1 快速而肮脏的技巧，谢谢！这对让我上路很有帮助。这里似乎有一些错别字，我无法使 data.table 位工作。
@joran 我做到了，我杀了一只小猫。

【解决方案2】：

这是一个 ddply 函数，用于计算数据框中所有不是 d 的列的平均值：

lapply(list.1,
       function(x) {
         ddply(
           x,
           .(d),
           function(df_part) {
             result_df <- data.frame(d=df_part$d[1])
             non_d_cols <- colnames(df_part)[! colnames(df_part) == "d"]
             for (col in non_d_cols) {
               col_mean <- mean(df_part[[col]])
               col_name <- paste0(col, "_mean")
               result_df[[col_name]] <- col_mean
             }
             return(result_df)
           })
       })

在我看来，这是最简单的方法，它应该很好地推广到您可能想要在这些列上进行的其他计算。也许您可以传入要计算平均值的列的字符向量参数，并使用它来代替non_d_cols。

【讨论】：

谢谢，这很有趣，可能会派上用场。