在 R 中的 data.table 中创建复合/交互虚拟变量答案

【问题标题】：Creating compound/interacted dummy variables in data.table in R在 R 中的 data.table 中创建复合/交互虚拟变量
【发布时间】：2015-08-20 14:46:39
【问题描述】：

仍在学习这个很棒的包 data.table。我正在处理以下 data.table：

demo <- data.table(id = c(1, 2, 3, 4, 5, 6), sex = c(1, 2, 1, 2, 2, 2), agef = c(43, 53, 63, 73, 83, 103))

demo: 
id sex agef
 1  1   43
 2  2   53
 3  1   63
 4  2   73
 5  2   83
 6  2   103

我正在尝试将新列（age_gender 波段）生成为 ("F0_34","F35_44","F45_54","F55_59"........"F95_GT") 和 ("M0_34"," M35_44","M45_54","M55_59"........"M95_GT") 将根据列sex 和agef 的值生成它们的名称和值。我能够以一种简单的方式做到：

demo <- demo[ ,F0_34:= {ifelse((sex==2) & (agef >= 0) & (agef <= 34), 1, 0)}]

但我正在为此寻找一个优雅的解决方案，我尝试将 age_band 作为 lapply 函数中的列表传递，如下所示：

i <- list("0_34","35_44","45_54","55_59","60_64","65_69","70_74","75_79","80_84","85_89","90_94","95_GT") 

demo[, paste0("F", i) := lapply(i, function(i)lapply(.SD, function(x){
l1 <- unlist(str_split(i, "_"))
if(l1[2] == "GT") l1[2] <- 1000
l1 <- as.numeric(l1)
score <- ifelse((sex==2) & (agef >= l1[1]) & (agef <= l1[2]), 1, 0)
return(score)  
})), .SDcols = c("sex", "agef"), by = id]

demo[, paste0("M", i) := lapply(i, function(i)lapply(.SD, function(x){
l1 <- unlist(str_split(i, "_"))
if(l1[2] == "GT") l1[2] <- 1000
l1 <- as.numeric(l1)
score <- ifelse((sex==1) & (agef >= l1[1]) & (agef <= l1[2]), 1, 0)
return(score)  
})), .SDcols = c("sex", "agef"), by = id]

我得到了想要的输出：

id  sex agef    F0_34   F35_44  F45_54  F55_59  F60_64  F65_69  F70_74  F75_79  F80_84  F85_89  F90_94  F95_GT  M0_34   M35_44  M45_54  M55_59  M60_64  M65_69  M70_74  M75_79  M80_84  M85_89  M90_94  M95_GT
1   1   43      0       0       0       0       0       0       0       0       0       0       0       0       0       1       0       0       0       0       0       0       0       0       0       0
2   2   53      0       0       1       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0
3   1   63      0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       1       0       0       0       0       0       0       0
4   2   73      0       0       0       0       0       0       1       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0
5   2   83      0       0       0       0       0       0       0       0       1       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0
6   2   103     0       0       0       0       0       0       0       0       0       0       0       1       0       0       0       0       0       0       0       0       0       0       0       0

但有一些警告：

Warning messages:
1: In `[.data.table`(demographic1, , `:=`(paste0("F", i),  ... :
RHS 1 is length 2 (greater than the size (1) of group 1). The last 1    element(s) will be discarded.

我无法理解，有人可以指出我做错了什么吗？

【问题讨论】：

我将为以“M”开头的列运行相同的代码。我将编辑代码。
OP 在同一列中阅读了哈德利关于混合性别和年龄的看法vita.had.co.nz/papers/tidy-data.pdf
我不认为完全禁止混合分类变量，但我也会看看。 @nsDataSci 我会建议一个不同的标题。您的列名与表中的数据无关，而是由您自己选择的切点确定。 “在 data.table 中创建复合/交互虚拟变量”怎么样？这个问题有一个更简单的版本，标题如下：stackoverflow.com/questions/18881073/…
@plafort ：谢谢，我会看一下......但是当你有基于混合的次要价值时，你无法避免这种情况，比如该乐队的相对系数。
@frank : 是的，绝对有道理.. 我会更改标题以使其更清晰.. 谢谢

标签： r data.table

【解决方案1】：

这是你要找的吗：

age.brackets <- c(0,seq(35,55, by=10), seq(60,95, by=5), Inf) #age ranges
ranges <- (cut(demo$agef, age.brackets))
split(demo, demo$sex)
spread <- table(demo$agef, ranges) #identify persons in each range
male.spread <- (demo$sex=='1')*as.matrix(spread)
female.spread <- (demo$sex=='2')*as.matrix(spread)

newdt <- data.table(
  cbind(
    demo,
    matrix(as.vector(male.spread), ncol=ncol(male.spread)),
    matrix(as.vector(female.spread), ncol=ncol(female.spread))
    )
)


    #column names
names(newdt) <- c(names(demo), 
                  levels(cut(demo$agef, age.brackets)),
                  levels(cut(demo$agef, age.brackets))
                  )
female.names <- gsub('.(\\d*),(\\d*|Inf).', 'F\\1_\\2', levels(cut(demo$agef, age.brackets)))        
male.names <- gsub('.(\\d*),(\\d*|Inf).', 'M\\1_\\2', levels(cut(demo$agef, age.brackets)))
names(newdt) <- c(names(demo), female.names, male.names)


newdt

#    id sex agef F0_35 F35_45 F45_55 F55_60 F60_65 F65_70 F70_75 F75_80 F80_85 F85_90
# 1:  1   1   43     0      1      0      0      0      0      0      0      0      0
# 2:  2   2   53     0      0      0      0      0      0      0      0      0      0
# 3:  3   1   63     0      0      0      0      1      0      0      0      0      0
# 4:  4   2   73     0      0      0      0      0      0      0      0      0      0
# 5:  5   2   83     0      0      0      0      0      0      0      0      0      0
# 6:  6   2  103     0      0      0      0      0      0      0      0      0      0
#    F90_95 F95_Inf M0_35 M35_45 M45_55 M55_60 M60_65 M65_70 M70_75 M75_80 M80_85 M85_90
# 1:      0       0     0      0      0      0      0      0      0      0      0      0
# 2:      0       0     0      0      1      0      0      0      0      0      0      0
# 3:      0       0     0      0      0      0      0      0      0      0      0      0
# 4:      0       0     0      0      0      0      0      0      1      0      0      0
# 5:      0       0     0      0      0      0      0      0      0      0      1      0
# 6:      0       0     0      0      0      0      0      0      0      0      0      0
#    M90_95 M95_Inf
# 1:      0       0
# 2:      0       0
# 3:      0       0
# 4:      0       0
# 5:      0       0
# 6:      0       1

【讨论】：

我不确定我是否遵循此解决方案的工作原理。它似乎依赖于（1）每个乐队最多有一个人（因为 table 会报告高于 1 的数字）和（2）人们按年龄排序（以便 cbind 正确排列 id）...... ?
@Frank 每个年龄组可以有多个人。如果某人同时是两个年龄，则会显示大于一的值。而spread 的顺序是从demo$agef 继承的，demo$agef 的顺序与demo 相同，不需要排序。
我对解决方案的问题是它结合了男性和女性分类，OP 希望男性和女性在两个单独的表上。我一直在做一个项目，所以我还不能更新。
如果我可以创建一个男性 spread 和一个女性，我认为该解决方案会很好。快捷方便。有什么建议？ @弗兰克
OP 我想出了一种方法来扩展男性和女性的列。感谢@Frank 的想法，它经过了轻微的编辑。而不是demo$agef，我乘以demo$sex。

【解决方案2】：

这应该可行，而且更多 data.table-y:

cut_points <- c(0, seq(35, 55, by = 10), seq(60, 95, by = 5),Inf)
new_names_m <- paste0("M", cut_points[1:12], "_", c(cut_points[2:12], "GT"))
new_names_f <- paste0("F", cut_points[1:12], "_", c(cut_points[2:12], "GT"))
demo[sex == 1, ranges := cut(agef, cut_points, include.lowest = TRUE,
                        labels = new_names_m)]
demo[sex == 2, ranges := cut(agef, cut_points, include.lowest = TRUE,
                        labels = new_names_f)]
demo[ ,(c(new_names_m, new_names_f)) :=
       lapply(c(new_names_m, new_names_f), function(x) +(ranges == x))]
demo[ , ranges := NULL]

> demo
   id sex agef M0_35 M35_45 M45_55 M55_60 M60_65 M65_70 M70_75 M75_80 M80_85 M85_90 M90_95 M95_GT F0_35 F35_45 F45_55 F55_60 F60_65
1:  1   1   43     0      1      0      0      0      0      0      0      0      0      0      0     0      0      0      0      0
2:  2   2   53     0      0      0      0      0      0      0      0      0      0      0      0     0      0      1      0      0
3:  3   1   63     0      0      0      0      1      0      0      0      0      0      0      0     0      0      0      0      0
4:  4   2   73     0      0      0      0      0      0      0      0      0      0      0      0     0      0      0      0      0
5:  5   2   83     0      0      0      0      0      0      0      0      0      0      0      0     0      0      0      0      0
6:  6   2  103     0      0      0      0      0      0      0      0      0      0      0      0     0      0      0      0      0
   F65_70 F70_75 F75_80 F80_85 F85_90 F90_95 F95_GT
1:      0      0      0      0      0      0      0
2:      0      0      0      0      0      0      0
3:      0      0      0      0      0      0      0
4:      0      1      0      0      0      0      0
5:      0      0      0      1      0      0      0
6:      0      0      0      0      0      0      1

或者，代替倒数第二行中的lapply，可以将虚拟对象初始化为零，然后将虚拟对象分配到适当的位置：

new_names = c(new_names_f, new_names_m)
demo[ , (new_names) := 0L]
is = which(demo$ranges != "")   
js = 3L + match(demo$ranges[is], new_names) 
for (iter in seq_along(is)) set(demo, i = is[iter], j = js[iter], value = 1L)

【讨论】：

我以一种我认为可能比lapply 所需的所有== 扫描更好的方式进行了编辑。另外，我认为没有必要保留答案的第一部分（没有“M”列），因为 OP 将编辑或已编辑使其过时。
很公平。虽然我的可读性更强，但肯定会更慢。