按组对新变量的数据表解决方案答案

【问题标题】：Data Table Solution To New Variables By Group按组对新变量的数据表解决方案
【发布时间】：2020-02-17 22:08:00
【问题描述】：

library(data.table)
library(dplyr)

dataHAVE=data.frame("student"=c(1,1,1,2,2,2,3,3,3),
                    "score"=c(0,8,8,7,9,4,9,2,7),
                    "time"=c(1,2,3,1,2,3,1,2,3))


dataWANT=data.frame("student"=c(1,1,1,2,2,2,3,3,3),
                    "score"=c(0,8,8,7,9,4,9,2,7),
                    "time"=c(1,2,3,1,2,3,1,2,3),
                    "score3"=c(1,1,1,0,0,0,1,1,1),
                    "timescore3"=c(1,1,1,3,3,3,2,2,2),
                    "score7"=c(1,1,1,1,1,1,1,1,1),
                    "timescore7"=c(1,1,1,1,1,1,2,2,2))




dataHAVE[, score3 := ifelse(score<=3,
                               time[which.min(score<=3)],
                               time[which.max(time)]), by=student]

我有“dataHAVE”并想生成“dataWANT”

1) 如果学生的任何分数小于或等于 3，则 score3 等于 1；否则 0

2) 如果学生的任何分数小于或等于 7，则 score7 等于 1；否则 0

3) timescore3 等于学生得分为 3 或更低的最小时间值；如果一个学生的分数不等于 3，分数不等于 3，则 timecsore3 是该学生的最长时间。

4) timescore7 等于学生得分为 7 或更低的最小时间值；如果学生的分数不等于 7，则得分不低于 7，则 timecsore7 是该学生的最长时间。

我在上面的尝试显示但不起作用，请注意，我尝试了 Base R 和 dplyr，但数据集太大，需要很长时间。 data.table 解决方案是理想的。

处理缺失的新数据::

dataHAVE=data.frame("student"=c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5),
                    "score"=c(0,8,8,7,9,4,9,2,7,NA,4,7,NA,NA,NA),
                    "time"=c(1,2,3,1,2,3,1,2,3,1,2,3,1,2,3))

更新的数据缺少“时间”

dataHAVE=data.frame("student"=c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5,6,6,6,7,7,7),
                    "score"=c(0,8,8,7,9,4,9,2,7,NA,4,7,NA,NA,NA,6,9,3,NA,NA,NA),
                    "time"=c(1,2,3,1,2,3,1,2,3,1,2,3,1,2,3,NA,2,NA,NA,NA,NA))

【问题讨论】：

第二个条件不清楚library(data.table)setDT(dataHAVE)[, c("score3", "score7") := .(+(any(score <=3)), +(any(score <=7))), .(student)];dataHAVE[, c("timescore3", "timescore7") := .(min(first(time[score <=3]), time[which.min(score)]), min(first(time[score <=7]), time[which.min(score)])), .(student)]
@akron 太棒了！最后询问：如何做到这一点并忽略缺失的 NA 值？
在你的例子中，我没有得到 NA，但如果分数中有 NA 元素，min 和 min 中的na.rm = TRUE 和max
@akrun 非常感谢当我尝试使用缺少 val 的数据并添加 na.rm=TRUE 之类的 min(time[sc3], na.rm=TRUE) 我得到:: missing value需要 TRUE/FALSE 的地方
可能是所有元素对于该特定元素都是 NA 的情况。组，在这种情况下，你需要if(all(is.na(time[sc3]))) NA else min(time[sc3], na.rm = TRUE)

标签： r dplyr data.table

【解决方案1】：

根据显示的逻辑，可以使用data.table完成

library(data.table)
setDT(dataHAVE)[,c("score3", "timescore3", "score7", "timescore7") := {
                   sc3 <- score <=3
                   sc7 <-  score <= 7
                   tsc3 <- if(any(sc3)) min(time[sc3]) else max(time)
                   tsc7 <- if(any(sc7)) min(time[sc7]) else max(time)
         .(+(any(sc3)), tsc3, +(any(sc7)),tsc7 )}, .(student)]

如果有缺失值，则使用

setDT(dataHAVE)[,c("score3", "timescore3", "score7", "timescore7") := {
                   sc3 <- score <=3 & !is.na(score)
                    sc7 <-  score <= 7 & !is.na(score)
                    tsc3 <- if(any(sc3)) min(time[sc3], na.rm = TRUE) else max(time, na.rm = TRUE)
                    tsc7 <- if(any(sc7)) min(time[sc7], na.rm = TRUE) else max(time, na.rm = TRUE)
          .(+(any(sc3)), tsc3, +(any(sc7)),tsc7 )}, .(student)]

【讨论】：

我添加了更新的数据文件来解决我提出的问题，即每个学生都有一些 NA 和一个学生都有 NA。
@bvowe 随着您的示例数据更新，它对我有用
非常感谢。谢谢，但是对于学生 5，所有新变量都应该是“NA”，因为没有数据。
我想我找到了问题所在，那就是：一些学生按时丢失了数据！我尝试了 min((time, na.rm=TRUE)[sc7], na.rm = TRUE) 的解决方案，但这并不成功
非常感谢您的帮助-您非常乐于助人！我添加了一个新示例