如何从组合的多列中创建单列？答案

【问题标题】：How can I create a single column from multiple columns combined?如何从组合的多列中创建单列？
【发布时间】：2022-02-04 12:39:34
【问题描述】：

我使用的数据集记录了受访者的种族。响应记录在多个变量中，并且允许受访者选择多个变量。示例：

ethnicity1     ethnicity2     ethnicity3    ethnicity4     ethnicity5     ethnicity6
         1              0              0             0              0              0        
         0              2              0             0              0              0    
         0              0              3             4              0              0

每个种族都有自己的专栏。我已经使用 recode 命令对每一列进行了重新编码，以便不同的数字代表不同的种族（即，黑色为 1，白色为 2，等等）以尝试制作单个种族变量

A) 从组合的多列中创建一列

B) 拥有它以便任何报告多个列的人都被指定为“多个”。

我的预期输出如下所示：

Ethnicity
      1
      2
     999

（我不确定是否最好用一个数值来表示多个种族以用于编码目的，或者让它是一个字符值，如“multiple”）

最初，我想这样做，但它并不像我希望的那样。

Ethnicity <- df %>% dplyr::na_if(0)
## create column for  ethnicity
Ethnicity %>% unite("RaceEthnicity", ethnicity1:ethnicity5, na.rm = TRUE, remove = FALSE)

【问题讨论】：

标签： r dplyr multiple-columns

【解决方案1】：

这是一个 tidyverse 解决方案。我假设您的数据有一个针对受访者的列。我已经添加了它并将其命名为ID。

要了解发生了什么，您可以通过连续添加每一行来运行代码，直到但不包括管道 (%>%) 并查看输出。

pivot_longer 使用的列将取决于您的真实数据是什么样的：这里的种族是 1-6，ID 是 7。

library(dplyr)
library(tidyr)

mydata %>% 
  # add IDs for respondent
  mutate(ID = LETTERS[1:3]) %>%
  # convert to 'long' format 
  pivot_longer(1:6) %>% 
  # remove zero value rows
  filter(value != 0) %>% 
  # group by person
  group_by(ID) %>% 
  # use value where there is one row per person, otherwise use 999
  # we need doubles for both values (existing are int)
  summarise(ethnicity = case_when(n() == 1 ~ as.double(value), 
                                  TRUE ~ 999)) %>% 
  ungroup() %>% 
  # discard duplicate rows
  distinct()

结果：

ID    ethnicity
  <chr>     <dbl>
1 A             1
2 B             2
3 C           999

具有更正列名的示例数据：

mydata <- structure(list(ethnicity1 = c(1L, 0L, 0L), 
                         ethnicity2 = c(0L, 2L, 0L), 
                         ethnicity3 = c(0L, 0L, 3L), 
                         ethnicity4 = c(0L, 0L, 4L), 
                         ethnicity5 = c(0L, 0L, 0L), 
                         ethnicity6 = c(0L, 0L, 0L)), 
                    class = "data.frame", 
                    row.names = c(NA, -3L))

【讨论】：

【解决方案2】：

这是dplyr 和purrr 的方法：

library(dplyr);library(purrr)
df %>%
  mutate(RaceEthnicity = select(cur_data(), enthnicity1:ethnicity6) %>%
                                  {case_when(pmap_lgl(., ~ all(is.na(.x))) ~ NA_real_,
                                             rowSums(.,na.rm = TRUE) == 0 ~ 0,
                                             rowSums(.,na.rm = TRUE) != pmap_int(.,pmax,na.rm = TRUE) ~ 999,
                                             TRUE ~ rowSums(.,na.rm = TRUE))})
  enthnicity1 enthnicity2 ethnicity3 enthnicity4 enthnicity5 ethnicity6 RaceEthnicity
1           1           0          0           0           0          0             1
2           0           2          0           0           0          0             2
3           0           0          3           4           0          0           999

这可能不是对新手最友好的方法，但您可以在 select 调用中定义列。选择后，我们将数据传入一组{}，这样数据就用.符号表示。从那里，我们使用dplyr::case_when 来测试多个条件。

如果所有列都是 NA，则返回 NA
如果rowSums = 0，返回0
如果 rowSums 不等于最大行数，则返回 999
否则，返回 rowSum（因为它的长度仅为 1 并且是感兴趣的种族）。

请注意您拼错了列名。

数据：

df <- structure(list(enthnicity1 = c(1L, 0L, 0L), enthnicity2 = c(0L, 
2L, 0L), ethnicity3 = c(0L, 0L, 3L), enthnicity4 = c(0L, 0L, 
4L), enthnicity5 = c(0L, 0L, 0L), ethnicity6 = c(0L, 0L, 0L)), class = "data.frame", row.names = c(NA, -3L))

【讨论】：

【解决方案3】：

这是另一个tidyverse 解决方案。在这里，我创建了一个新列（使用mutate），然后使用pmap 选择所有以种族开头的列。然后，我将该行中的所有内容放入列表中。然后，我从该列表中删除所有 0，并将任何具有多个值的行替换为 999，并仅保留 1 个唯一值。

library(tidyverse)

df %>%
  mutate(Ethnicity = pmap(
    select(., starts_with("ethnicity")),
    ~ c(...) %>%
      keep(~ all(. != 0)) %>%
      replace(length(.) > 1, 999) %>%
      unique
  ))

如果您需要简单地选择列（因为您的真实数据实际上可能没有每列的“种族”一词），那么您可以只输入列索引（例如，c(1:6)）或使用列名称（如下所示）。

df %>%
  mutate(Ethnicity = pmap(
    select(., c("ethnicity1", "ethnicity2", "ethnicity3", "ethnicity4", "ethnicity5", "ethnicity6")),
    ~ c(...) %>%
      keep(~ all(. != 0)) %>%
      replace(length(.) > 1, 999) %>%
      unique
  ))

另一种选择是将mutate 与ifelse 一起使用，并将具有多个值的任何行更改为999。

library(tidyverse)

df %>%
  mutate(Ethnicity = pmap(select(., starts_with("ethnicity")),  ~ c(...) %>%
                            keep( ~ all(. != 0)))) %>%
  rowwise %>%
  mutate(Ethnicity = ifelse(length(Ethnicity) > 1, 999, Ethnicity)) %>%
  select(Ethnicity)

输出

# A tibble: 3 × 1
# Rowwise: 
  Ethnicity
      <dbl>
1         1
2         2
3       999

数据

df <-
  structure(
    list(
      ethnicity1 = c(1L, 0L, 0L),
      ethnicity2 = c(0L, 2L, 0L),
      ethnicity3 = c(0L, 0L, 3L),
      ethnicity4 = c(0L, 0L, 4L),
      ethnicity5 = c(0L, 0L, 0L),
      ethnicity6 = c(0L, 0L, 0L)
    ),
    class = "data.frame",
    row.names = c(NA,-3L)
  )

【讨论】：

【解决方案4】：

在 Base R 中你可以这样做：

aggregate(.~row, data.frame(which(df>0, TRUE)), \(x) if(length(x)>1)999 else x)

  row col
1   1   1
2   2   2
3   3 999

【讨论】：

【解决方案5】：

我会提出另一种策略来考虑。似乎如果 ethnicityn 列的新数量是有限的（在简单的情况下少于 32），更好的方法可能是使用位掩码。这种方式在许多语言中用于类似目的，例如在 MySQL 列表列、Pascal/Delphi 集等中。在这种情况下，结果列将包含以下值：c(1L, 2L, 12L)

【讨论】：

【解决方案6】：

也许就这么简单？还是我忽略了什么？

library(dplyr)
df %>% 
  mutate(Ethnicity = rowSums(select(., contains("ethnicity"))),
         Ethnicity = ifelse(Ethnicity > 2, 999, Ethnicity))

  ethnicity1 ethnicity2 ethnicity3 ethnicity4 ethnicity5 ethnicity6 Ethnicity
1          1          0          0          0          0          0         1
2          0          2          0          0          0          0         2
3          0          0          3          4          0          0       999

【讨论】：