【发布时间】:2015-07-23 03:19:30
【问题描述】:
对于数据集中的每个学生,可能已经收集了一组特定的分数。我们想计算每个学生的平均值,但只使用与该学生相关的列中的分数。
计算中所需的列对于每一行都不同。我已经想出了如何使用常用工具在 R 中编写它,但我试图用 data.table 重写,部分是为了好玩,但部分是为了在这个小项目中取得成功,这可能导致需要进行计算很多很多行。
这是一个“为每一行问题选择特定列集”的小型工作示例。
set.seed(123234)
## Suppose these are 10 students in various grades
dat <- data.frame(id = 1:10, grade = rep(3:7, by = 2),
A = sample(c(1:5, 9), 10, replace = TRUE),
B = sample(c(1:5, 9), 10, replace = TRUE),
C = sample(c(1:5, 9), 10, replace = TRUE),
D = sample(c(1:5, 9), 10, replace = TRUE))
## 9 is a marker for missing value, there might also be
## NAs in real data, and those are supposed to be regarded
## differently in some exercises
## Students in various grades are administered different
## tests. A data structure gives the grade to test linkage.
## The letters are column names in dat
lookup <- list("3" = c("A", "B"),
"4" = c("A", "C"),
"5" = c("B", "C", "D"),
"6" = c("A", "B", "C", "D"),
"7" = c("C", "D"),
"8" = c("C"))
## wrapper around that lookup because I kept getting confused
getLookup <- function(grade){
lookup[[as.character(grade)]]
}
## Function that receives one row (named vector)
## from data frame and chooses columns and makes calculation
getMean <- function(arow, lookup){
scores <- arow[getLookup(arow["grade"])]
mean(scores[scores != 9], na.rm = TRUE)
}
stuscores <- apply(dat, 1, function(x) getMean(x, lookup))
result <- data.frame(dat, stuscores)
result
## If the data is 1000s of thousands of rows,
## I will wish I could use data.table to do that.
## Client will want students sorted by state, district, classroom,
## etc.
## However, am stumped on how to specify the adjustable
## column-name chooser
library(data.table)
DT <- data.table(dat)
## How to write call to getMean correctly?
## Want to do this for each participant (no grouping)
setkey(DT, id)
所需的输出是相应列的学生平均值,如下所示:
> result
id grade A B C D stuscores
1 1 3 9 9 1 4 NaN
2 2 4 5 4 1 5 3.0
3 3 5 1 3 5 9 4.0
4 4 6 5 2 4 5 4.0
5 5 7 9 1 1 3 2.0
6 6 3 3 3 4 3 3.0
7 7 4 9 2 9 2 NaN
8 8 5 3 9 2 9 2.0
9 9 6 2 3 2 5 3.0
10 10 7 3 2 4 1 2.5
然后呢?到目前为止,我已经写了很多错误...
我在数据表示例中没有找到任何示例,其中每行计算中使用的列本身就是一个变量,感谢您的建议。
我并没有要求任何人为我编写代码,我是在寻求有关如何着手解决此问题的建议。
【问题讨论】:
-
我不清楚你在问什么。请指定您想要的输出、数据以及到目前为止您尝试过的内容。
-
您应该learn to use data.tables,尝试使用它,并在此处提出您的问题/进行改进.. 不要让人们翻译您的代码。
标签: r data.table