如何将逗号分隔的变量分组在同一列中？答案

【问题标题】：How to group comma-separated variables in the same column?如何将逗号分隔的变量分组在同一列中？
【发布时间】：2020-07-15 17:42:22
【问题描述】：

这是我的虚假数据：

#> id   column                 
#> 1    blue, red, dog, cat
#> 2    red, blue, dog
#> 3    blue      
#> 4    red
#> 5    dog, cat   
#> 6    cat
#> 7    red, cat
#> 8    dog
#> 9    cat, red
#> 10   blue, cat

例如，我想告诉 R dog and cat = animal 和 red and blue = colour。我想基本上计算动物、颜色和两者的数量（以及最终百分比）。

#> id   column                 newcolumn
#> 1    blue, red, dog, cat    both
#> 2    red, blue, dog         both
#> 3    blue                   colour
#> 4    red                    colour
#> 5    dog, cat               animal
#> 6    cat                    animal
#> 7    red, cat               both
#> 8    dog                    animal
#> 9    cat, red               both
#> 10   blue, cat              both

到目前为止，我只能通过执行以下操作来合计红色、蓝色、狗和猫的数量：

column.string<-paste(df$column, collapse=",")
column.vector<-strsplit(column.string, ",")[[1]]
column.vector.clean<-gsub(" ", "", column.vector)
table(column.vector.clean)

非常感谢您的帮助，这是我的虚假数据示例：

df <- data.frame(id = c(1:10), 
                 column = c("blue, red, dog, cat", "red, blue, dog", "blue", "red", "dog, cat", "cat", "red, cat", "dog", "cat, red", "blue, cat"))

【问题讨论】：

标签： r

【解决方案1】：

您可以在向量中定义所有可能的animals 和colours。用逗号分割column 并测试：

animal <- c('dog', 'cat')
colour <- c('red', 'blue')

df$newcolumn <- sapply(strsplit(df$column, ',\\s*'), function(x) {
                 x <- x[x != "NA"]
                 if(!length(x)) return(NA)
                 if(all(x %in% animal)) 'animal'
                 else if(all(x %in% colour)) 'colour'
                 else 'both'
                 })

df
#   id              column newcolumn
#1   1 blue, red, dog, cat      both
#2   2      red, blue, dog      both
#3   3                blue    colour
#4   4                 red    colour
#5   5            dog, cat    animal
#6   6                 cat    animal
#7   7            red, cat      both
#8   8                 dog    animal
#9   9            cat, red      both
#10 10           blue, cat      both

要计算比例，您可以使用prop.table 和table：

prop.table(table(df$newcolumn, useNA = "ifany"))

#animal   both colour 
#   0.3    0.5    0.2

使用dplyr，我们可以用逗号分隔行，为每个id根据条件创建一个newcolumn并计算比例。

library(dplyr)

df %>%
  tidyr::separate_rows(column, sep = ',\\s*') %>%
  group_by(id) %>%
  summarise(newcolumn = case_when(all(column %in% animal) ~ 'animal', 
                                  all(column %in% colour) ~ 'colour', 
                                  TRUE ~ 'both'),
            column = toString(column)) %>%
  count(newcolumn) %>%
  mutate(n = n/sum(n))

【讨论】：

感谢您的回答！我尝试了第一个解决方案，但最终出现错误Error in strsplit(df$column, ",\\s*") : non-character argument 我错过了什么吗？我必须安装任何软件包吗？
不，那是因为您的 R 版本 column 属于“因素”类。先运行df$column <- as.character(df$column)将其转换为字符，然后尝试答案。
我还有一个问题，如果有 'NA' 值会怎样？我可以将它们排除在“两者”之外还是自动排除？
如果它们是实际的NA，您可以在sapply 调用的开头添加一行x <- na.omit(x)，在if 之前删除它们。
抱歉，确定一下，像这样吗？ df$newcolumn <- sapply(strsplit(df$column, ',\\s*'), function(x) { x <- na.omit(x) if(all(x %in% animal)) 'animal' else if(all(x %in% colour)) 'colour' else 'both' })