按因子拆分 df，应用函数，并在 r 中返回组合 df答案

【问题标题】：Splitting df by factors, applying a function, and returning combined df in r按因子拆分 df，应用函数，并在 r 中返回组合 df
【发布时间】：2014-01-15 19:49:25
【问题描述】：

我知道关于这个主题有很多问题，但我无法通过查看各种答案来解决我的问题。我有一个 df - 摘录附在下面：

ID = as.factor(c("1","1","1","1","1",
                 "2","2","2",
                 "3","3","3","3",
                 "4","4","4","4","4"))
AdDate = c("2010-03-04", "2010-04-05", "2011-01-23", "2011-03-20", "2012-07-08",
           "2010-12-02", "2011-05-17", "2011-09-11",
           "2010-04-11", "2010-05-15", "2011-02-22", "2011-09-23",
           "2009-10-04", "2010-02-15", "2010-08-17", "2011-06-20", "2012-04-08")
OpofInterest = c("FALSE", "FALSE", "TRUE", "FALSE", "FALSE",
                 "FALSE", "TRUE", "FALSE",
                 "FALSE", "FALSE", "TRUE", "FALSE",
                 "FALSE", "FALSE", "TRUE", "FALSE", "FALSE")
df = data.frame(ID, AdDate, OpofInterest)

然后我想要做的是按 ID 将 df 拆分为多个数据帧（本例中为 4 个），然后应用下面的函数来分配其他情节（每行）是否在之前（手术前），相同的（每次手术），或基于 AdDate 的每个人（ID）的感兴趣的手术后（手术后）。我是 R 和编程的新手，并在下面生成了一个函数。实际上，我有数千个 ID 和剧集，以及大约 80 列，因此我无法单独子集并应用经过一些调整后才开始工作的函数。

prepostassignment <- function (df) {

df_OpofInterest = subset(df,(df["OpofInterest"] == "TRUE"))  

for (i in 1:nrow(df)) {

if (df$AdDate[i] < df_OpofInterest$AdDate) {
    df$Pre_Post_Assignment[i] = "Pre"

} else if (df$AdDate[i] == df_OpofInterest$AdDate) {
  df$Pre_Post_Assignment[i] = "Per"

} else if (df$AdDate[i] > df_OpofInterest$AdDate) { 
  df$Pre_Post_Assignment[i] = "Post"

  }
 }
}

我玩过 by、tapply、aggregate、ddply，但似乎无法想出一个解决方案。在手动子集上使用该函数时，我也收到以下错误消息：

需要 TRUE/FALSE 的地方缺少值

我也读过这个，但不明白我的特定代码哪里出了问题

我想要的结果如下：

ID = as.factor(c("1","1","1","1","1",
                 "2","2","2",
                 "3","3","3","3",
                 "4","4","4","4","4"))
AdDate = c("2010-03-04", "2010-04-05", "2011-01-23", "2011-03-20", "2012-07-08",
           "2010-12-02", "2011-05-17", "2011-09-11",
           "2010-04-11", "2010-05-15", "2011-02-22", "2011-09-23",
           "2009-10-04", "2010-02-15", "2010-08-17", "2011-06-20", "2012-04-08")
OpofInterest = c("FALSE", "FALSE", "TRUE", "FALSE", "FALSE",
                 "FALSE", "TRUE", "FALSE",
                 "FALSE", "FALSE", "TRUE", "FALSE",
                 "FALSE", "FALSE", "TRUE", "FALSE", "FALSE")
Pre_Post_Assignment = c("Pre", "Pre", "Per", "Post", "Post",
                        "Pre", "Per", "Post",
                        "Pre", "Pre", "Per", "Post",
                        "Pre", "Pre", "Per", "Post", "Post")
df_new = data.frame(ID, AdDate, OpofInterest, Pre_Post_Assignment)

任何帮助将不胜感激。

谢谢。

【问题讨论】：

第二个代码块中的df_OpofInterest 和df_TAVI 是什么？
抱歉，df_TAVI 应该是 df_OpofInterest。我正在对“感兴趣的操作”进行子集化以获得用于函数的 AdDate
为什么从character 类的列开始（然后转换为factor）？ ID不应该是integer，日期是Date，OpofInterest是logical吗？
我认为我可能需要在某些时候拆分 df 的因素。但是，是的，你是对的。

标签： r

【解决方案1】：

这是经典的拆分-应用-组合分析。这是使用data.table 的选项：

df = data.frame(ID, AdDate, OpofInterest, stringsAsFactors=FALSE)
df$OpofInterest <- as.logical(df$OpofInterest)
library(data.table)
dt <- data.table(df)
dt[, 
  cbind(
    .SD,
    Pre_Post_Assignment=
      ifelse(
         AdDate < AdDate[OpofInterest], 
         "Pre",
         ifelse(AdDate == AdDate[OpofInterest], "Per", "Post"
    ) ) ), 
  by=ID]
#     ID     AdDate OpofInterest Pre_Post_Assignment
#  1:  1 2010-03-04        FALSE                 Pre
#  2:  1 2010-04-05        FALSE                 Pre
#  3:  1 2011-01-23         TRUE                 Per
#  4:  1 2011-03-20        FALSE                Post
#  5:  1 2012-07-08        FALSE                Post
#  6:  2 2010-12-02        FALSE                 Pre
#  7:  2 2011-05-17         TRUE                 Per
#  8:  2 2011-09-11        FALSE                Post
#  9:  3 2010-04-11        FALSE                 Pre
# 10:  3 2010-05-15        FALSE                 Pre
# 11:  3 2011-02-22         TRUE                 Per
# 12:  3 2011-09-23        FALSE                Post
# 13:  4 2009-10-04        FALSE                 Pre
# 14:  4 2010-02-15        FALSE                 Pre
# 15:  4 2010-08-17         TRUE                 Per
# 16:  4 2011-06-20        FALSE                Post
# 17:  4 2012-04-08        FALSE                Post

您也可以为此使用ddply。实际计算的核心是两个嵌套的ifelse 语句。 [.data.table 的第二个参数是除拆分/分组列之外的列列表（此处为ID）。 .SD 变量是一个特殊的data.table 变量，它包含组中未在by 参数中引用的所有列（这里它将包含AdDate 和OpofInterest）。我们将 cbind 我们的附加向量 .SD 以使用额外的列创建我们的新结果。

其他几个值得注意的点：

我将日期转换为字符串以进行比较
我将OpofInterest 转换为逻辑

最后，免责声明，虽然这里执行的分析类型是 split-apply-combine，但 data.table 中的幕后实现不会拆分，而是应用，而是子集和迭代（我在这里注意到这一点所以 Arun 不会生我的气）。

编辑：这是@BlueMagister 的建议：

dt[, 
  Pre_Post_Assignment:=
    ifelse(
      AdDate < AdDate[OpofInterest], 
      "Pre",
      ifelse(AdDate == AdDate[OpofInterest], "Per", "Post")
    ),
   by=ID
]

我认为它更干净，而且很可能也更快。

【讨论】：

感谢您的帮助和建议，我很感激。例如，当您在代码的“list(cols... = ifelse”) 部分中有 80 多列时，您会建议做什么？列名的命名向量？
@sgurwin，我修改了答案以解决您的问题。查看.SD 的东西。
在.SD 上定义一个新列比在cbind 上更好，不是吗？
@BlueMagister，你的意思是:=？那么是的，可能。没想到。