将“选择所有适用”转换为二元选择答案

【问题标题】：Convert "select all that apply" to binary choices将“选择所有适用”转换为二元选择
【发布时间】：2014-02-19 23:10:08
【问题描述】：

我有一个调查回复数据框，其中一些列是参与者可以选择多个答案的问题（“选择所有适用的”）。

> age <- c(24, 28, 44, 55, 53)
> ethnicity <- c("ngoni", "bemba", "lozi tonga", "bemba tonga other", "bemba tongi")
> ethnicity_other <- c(NA, NA, "luvale", NA, NA) 
> df <- data.frame(age, ethnicity, ethnicity_other)

我希望将这些问题设置为二元响应项，以便每个响应选项（在本例中为 ethnicity 和 ethnicity_other）都成为具有 0 或 1 的列向量。

到目前为止，我编写了一个脚本，将各个独特的响应分成一个列表 (z)：

> x <- unique(as.vector(unlist(strsplit(as.character(df$ethnicity_other), " ")),    mode="list"))
> y <- unique(as.vector(unlist(strsplit(as.character(df$ethnicity), " ")), mode="list"))
>
> combine <- c(x, y)
>
> z <- NULL
> for(i in combine){
> if(!is.na(i)){
> z <- append(z, i)
>   }   
> }

然后我从该列表中创建了新列并用 NA 值填充它们。

> for(elm in z){
>   df[paste0("ethnicity_",elm)]  <- NA
> }

所以现在我有 35 个额外的列，我想用 1 和 0 填充，具体取决于该列名称（或该列名称的一部分，因为我在其前面加上 ethnicity_ 前缀）可以在相应的ethnicity 或 ethnicity_other. 下的单元格我尝试了多种方法但没有好的解决方案。

【问题讨论】：

标签： r survey

【解决方案1】：

这里有几种使用plyr 或data.table 的方法。

all_ethnicities <- unique(c(
    unlist(strsplit(df$ethnicity, " ")),
    unlist(strsplit(df$ethnicity_other, " "))
    ))

df$id <- 1:nrow(df)

library(plyr)

ddply(df, .(id), function(x)
      table(factor(unlist(strsplit(paste(x$ethnicity, x$ethnicity_other), " ")),
                   levels = all_ethnicities)))

##    id ngoni bemba lozi tonga other tongi luvale
## 1  1     1     0    0     0     0     0      0
## 2  2     0     1    0     0     0     0      0
## 3  3     0     0    1     1     0     0      1
## 4  4     0     1    0     1     1     0      0
## 5  5     0     1    0     0     0     1      0

library(data.table)

DT <- data.table(df)

DT[, {
    as.list(
        table(
            factor(
                unlist(strsplit(paste(ethnicity, ethnicity_other),  " ")),
                levels = all_ethnicities)
            ),
        )
}, by = id]

##     id ngoni bemba lozi tonga other tongi luvale
## 1:  1     1     0    0     0     0     0      0
## 2:  2     0     1    0     0     0     0      0
## 3:  3     0     0    1     1     0     0      1
## 4:  4     0     1    0     1     1     0      0
## 5:  5     0     1    0     0     0     1      0

【讨论】：

哇，这太棒了。非常感谢。我有点不清楚 ddply 函数是如何工作的（function(x)...？），但我会多做一些修改。我也试图让每一列都以“ethnicity_”为前缀。在我的尝试中，我在创建列名时使用了粘贴功能，但我很难理解在您的第一个解释中列创建过程发生在哪里。再次感谢！！
@chrisnyoder ddply 按id 变量（在本例中为每一行）拆分数据，然后将该函数应用于每条数据。因此函数x 的输入将是1 行data.frame。尝试ddply(df, .(id), function(x) browser() ) 探索函数的环境是什么样的。要设置列名，最简单的解决方案是在运行此之后进行（即out <- ddply(df, ...) 然后names(out)[names(out) != "id"] <- paste0("ethnicity_", names(out)[names(out) != "id"])。我将在今天晚些时候有空的时候添加更多到这个答案

【解决方案2】：

我会这样做：

首先，您需要一些东西来存储每个参与者的种族。我的方法是建立一个列表：

ethnicities = sapply(X=df$ethnicity, FUN=function(response) {return (strsplit(as.character(response), " "))} )

对于您的特定示例，我们将：

> ethnicities
[[1]]
[1] "ngoni"

[[2]]
[1] "bemba"

[[3]]
[1] "lozi"  "tonga"

[[4]]
[1] "bemba" "tonga" "other"

[[5]]
[1] "bemba" "tongi"

然后，遍历这些以填充您的 data.frame df：

for (i in seq_along(ethnicities)) {
  for (eth in ethnicities[[i]]) {
    df[[paste0('ethnicity_',eth)]][i]=1
  }
}

df 的结果值应该是：

> df
  age         ethnicity ethnicity_other ethnicity_luvale ethnicity_ngoni ethnicity_bemba
1  24             ngoni              NA               NA               1              NA
2  28             bemba              NA               NA              NA               1
3  44        lozi tonga              NA               NA              NA              NA
4  55 bemba tonga other               1               NA              NA               1
5  53       bemba tongi              NA               NA              NA               1
  ethnicity_lozi ethnicity_tonga ethnicity_tongi
1             NA              NA              NA
2             NA              NA              NA
3              1               1              NA
4             NA               1              NA
5             NA              NA               1

还有其他方法可以做到这一点。您也可以将这两个 for 循环 打包到 sapply 中，但我觉得生成的代码不会更高效（但会更复杂阅读！）。

这有帮助吗？

编辑：

顺便说一句，如果你真的想在 data.frame 中使用 0 而不是 NA，那么就像更改初始化添加的列的代码一样简单：

> for(elm in z){
>   df[paste0("ethnicity_",elm)]  <- 0 # instead of NA
> }

【讨论】：

【解决方案3】：

这是使用我的“splitstackshape”包中的concat.split.expanded 的方法：

## Combine your "ethnicity" and "ethnicity_other" column
df$ethnicity <- paste(df$ethnicity, 
                      ifelse(is.na(df$ethnicity_other), "", 
                             as.character(df$ethnicity_other)))
## Drop the original "ethnicity_other" column
df$ethnicity_other <- NULL

## Split up the new "ethnicity" column
library(splitstackshape)
concat.split.expanded(df, "ethnicity", sep=" ", 
                      type="character", fill=0, drop=TRUE)
#   age ethnicity_bemba ethnicity_lozi ethnicity_luvale ethnicity_ngoni
# 1  24               0              0                0               1
# 2  28               1              0                0               0
# 3  44               0              1                1               0
# 4  55               1              0                0               0
# 5  53               1              0                0               0
#   ethnicity_other ethnicity_tonga ethnicity_tongi
# 1               0               0               0
# 2               0               0               0
# 3               0               1               0
# 4               1               1               0
# 5               0               0               1

fill 参数可以轻松设置为您想要的任何其他值。默认为NA，但在这里，我将其设置为0，因为我认为这就是您要查找的内容。

【讨论】：