如何生成分组虚拟变量？答案

【问题标题】：How to generate grouped dummy variables?如何生成分组虚拟变量？
【发布时间】：2013-08-16 03:46:06
【问题描述】：

我正在寻找生成虚拟变量的方法，这些虚拟变量将给定类别分成所有可能的分组组合。例如，如果我们有三个类别（例如 A、B 和 C），则有五种可能的分组：

Three groups: A / B / C
Two groups: A&B / C
Two groups: A&C / B
Two groups: A / B&C
One group: A&B&C

然后每个分组的虚拟变量将输出到数据帧的不同列。所以我想要的最终输出如下表：

sample_num  category    grouping1   grouping2   grouping3   grouping4   grouping5
                        A; B; C     A&B; C      A&C; B      A; B&C      A&B&C
-----------+---------+------------+-----------+-----------+-----------+----------
      1         A           1           1           1           1           1
      2         A           1           1           1           1           1
      3         A           1           1           1           1           1
      4         A           1           1           1           1           1
      5         B           2           1           2           2           1
      6         B           2           1           2           2           1
      7         B           2           1           2           2           1
      8         C           3           2           1           2           1
      9         C           3           2           1           2           1
     10         C           3           2           1           2           1
     11         C           3           2           1           2           1
     12         C           3           2           1           2           1

【问题讨论】：

您的最终输出不清楚 - 属于什么类别？
我删除了所有要求包装建议的部分，因为这是问题可能被关闭的原因之一。如果您不喜欢这样，您可以还原更改。
谢谢。我是这个网站的新手，我以某种方式取消了您的编辑。试图把他们带回来。
@mnel - 数字与每个分组中类别字母的索引有关 - 请参阅我的编辑。
@thelatemail -- 我明白了。也许A&B 应该是A|B。

标签： r

【解决方案1】：

stats 包中的 model.matrix 函数（默认加载）将构造“虚拟变量”，尽管不是您描述的那种。第一个参数是一个 R“公式”：

>dat <- read.table(text="sample_num  category 
+       1         A      
+       2         A      
+       3         A      
+       4         A      
+       5         B      
+       6         B      
+       7         B      
+       8         C      
+       9         C      
+      10         C      
+      11         C      
+      12         C", header=TRUE)
> model.matrix( ~category, data=dat)

   (Intercept) categoryB categoryC
1            1         0         0
2            1         0         0
3            1         0         0
4            1         0         0
5            1         1         0
6            1         1         0
7            1         1         0
8            1         0         1
9            1         0         1
10           1         0         1
11           1         0         1
12           1         0         1
attr(,"assign")
[1] 0 1 1
attr(,"contrasts")
attr(,"contrasts")$category
[1] "contr.treatment"

我（强烈）怀疑你的四列假人组一定是线性相关的，其中一个会被回归函数拒绝。其他对比论点是可能的。你应该学习：

?model.matrix
?contrasts

这是没有截距的总和对比：

> model.matrix(~category+0, data=dat, contrasts = list(category = "contr.sum"))
   categoryA categoryB categoryC
1          1         0         0
2          1         0         0
3          1         0         0
4          1         0         0
5          0         1         0
6          0         1         0
7          0         1         0
8          0         0         1
9          0         0         1
10         0         0         1
11         0         0         1
12         0         0         1
attr(,"assign")
[1] 1 1 1
attr(,"contrasts")
attr(,"contrasts")$category
[1] "contr.sum"

如果您想查看不同交互级别的自动计算，您将需要三个变量，而不是一个具有三个级别的变量：

> dat <- expand.grid(A=letters[1:3], B=letters[4:6], C=letters[7:9])
> str(model.matrix( ~ A*B*C))
Error in str(model.matrix(~A * B * C)) : 
  error in evaluating the argument 'object' in selecting a method for function 'str': Error in model.frame.default(object, data, xlev = xlev) : 
  invalid type (closure) for variable 'C'
> str(model.matrix( ~ A*B*C, data=dat))
 num [1:27, 1:27] 1 1 1 1 1 1 1 1 1 1 ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:27] "1" "2" "3" "4" ...
  ..$ : chr [1:27] "(Intercept)" "Ab" "Ac" "Be" ...
 - attr(*, "assign")= int [1:27] 0 1 1 2 2 3 3 4 4 4 ...
 - attr(*, "contrasts")=List of 3
  ..$ A: chr "contr.treatment"
  ..$ B: chr "contr.treatment"
  ..$ C: chr "contr.treatment"

model.matrix( ~ A*B*C, data=dat)

omitted output

【讨论】：

谢谢 DWin。但是如果我们有超过三个类别呢？当我们有四个类别（如 A、B、C 和 D）时，我们将有具有两个虚拟变量的组，每个类别包括两个类别（例如，虚拟变量“0”代表 A 和 B，“1”代表C 和 D，依此类推）。