结合原始列名将列变量拆分为新列答案

【问题标题】：Split column variables into new columns in combination with original column name结合原始列名将列变量拆分为新列
【发布时间】：2018-05-29 08:29:37
【问题描述】：

我有一个数据框，在多个单元格中有多个条目。共有三种列：仅包含 1/0 的列，包含 1/0 和其他一些条目的列，以及不包含 1/0 的列。

我想要做的是将所有包含其他值（通常是两个或多个条目）的列拆分为 x 个新列，列名 + 单元格中的值对应列中的每个唯一值，和 1/0 是否存在。所有只有 1/0 的列都将保持原样。

注意：我的原始数据框更大并且有很多列。此外，单元格中的内容可能因数据框而异，无论单元格中有什么/多少条目，我都希望它能够正常工作。另请注意，我不想拆分列，因为它们仅包含 1/0（例如 emrY），或者因为它们包含其他数据（例如 T_CIP）。

数据框：

structure(list(id = 1:10, emrA = c("I219V, T286A", "I219V", "I219V", 
"I219V", "I219V", "R164H, I219V", "R164H, I219V", "R164H, I219V", 
"R164H, I219V", "R164H, I219V"), gyrA_8 = c("S83L,678E", "D87N", 
"S83L,252G", "S83L,678E", "S83L,678E", "S83L,828T", "S83L,828T", 
"S83L,828T", "S83L,828T", "S83L,828T"), emrY = c("0", "1", "1", 
"1", "1", "1", "1", "1", "1", "1"), T_CIP = c(0.25, 0.12, 0.12, 
0.25, 0.25, 0.5, 2, 1, 1, 2)), .Names = c("id", "emrA", "gyrA_8", 
"emrY", "T_CIP"), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-10L))

它的样子：

     id emrA         gyrA_8    emrY  T_CIP
      1 I219V, T286A S83L,678E 0     0.25
      2 I219V        D87N      1     0.12
      3 I219V        S83L,252G 1     0.12
      4 I219V        S83L,678E 1     0.25
      5 I219V        S83L,678E 1     0.25
      6 R164H, I219V S83L,828T 1     0.5
      7 R164H, I219V S83L,828T 1     2
      8 R164H, I219V S83L,828T 1     1
      9 R164H, I219V S83L,828T 1     1
     10 R164H, I219V S83L,828T 1     2

我想得到什么：

id   emrA_I219V    emrA_T286A   emrA_R164H   gyrA_8_S83L   gyrA_8_678E   gyrA_8_D87N   gyrA_8_252G   gyrA_8_828T   emrY   T_CIP
 1   1             1            0            1             1             0             0             0             0      0.25
 2   1             0            0            0             0             1             0             0             1      0.12
 3   1             0            0            1             0             0             1             0             1      0.12
 4   1             0            0            1             1             0             0             0             1      0.25
 5   1             0            0            1             1             0             0             0             1      0.25
 6   1             0            1            1             0             0             0             1             1      0.5
 7   1             0            1            1             0             0             0             1             1      2
 8   1             0            1            1             0             0             0             1             1      1
 9   1             0            1            1             0             0             0             1             1      1
10   1             0            1            1             0             0             0             1             1      2

emrY 列未拆分，因为它仅包含 1/0。 T_CIP（和其他类似的列）没有被拆分，因为它包含其他数据。

有没有办法用 tidyverse-packages 做到这一点？

编辑：

我不认为标记为重复的问题回答了我的问题 - 他们没有包含不同内容的多个列，问题本身直接与虚拟变量有关，似乎无法解释我想要做什么在这里。

【问题讨论】：

Generate a dummy-variable的可能重复

标签： r tidyverse

【解决方案1】：

我会先设置要处理的列名：

names_to_proc <- c("emrA", "gyrA_8")

让我们构造一个函数来为每一列生成一组新的 1/0 列：

# @ col_name is one of the names_to_proc
AddCol <- function(df, col_name) {
    # split rows by delimeters
    string_to_proc <- df %>% select(!!col_name) %>%
       unlist() %>% str_split(regex("\\, |\\,")) 
    # find unique entries
    unique_strings <- string_to_proc %>%
       unlist() %>% unique()
    # construct names of the new columns
    cols_names <- paste(col_name, unique_strings, sep = "_")
    # construct 0/1-content columns for each unique entry
    cols_content <- sapply(function(i) {
            as.integer(unlist(lapply(function(Z) any(Z %in% unique_strings[i]), 
            X = string_to_proc)))
        }, X = seq_along(unique_strings))
    res <- data.frame(cols_content)
    names(res) <- cols_names
    return(res)
}

最后，应用该函数来获取应该替换已处理列的列集。为names_to_proc的每个值计算的1/0数据帧与bind_cols()绑定在一起：

# @ df_test is the initial data frame
cols_to_add <- sapply(function(i) {AddCol(df = df_test, col_name = names_to_proc[i])}, 
    X = seq_along(names_to_proc)) %>% 
    bind_cols()

将结果块添加到初始数据帧中，并进行一些额外的转换以获得所需格式的数据：

df_test %>% bind_cols(cols_to_add) %>% 
    select(-(2:3)) %>%
    select(-(emrY:T_CIP), everything())

希望，这会有所帮助。

【讨论】：

太棒了！谢谢！