【问题标题】：R tidyr: use separate function to separate character column with comma-separated text into multiple columns using RegExR tidyr：使用单独的函数将带有逗号分隔文本的字符列分隔为使用 RegEx 的多列
【发布时间】：2020-04-18 04:34:27
【问题描述】：

我有以下数据框

df <- data.frame(x=c("one", "one, two", "two, three", "one, two, three"))

看起来像这样

                x
1             one
2        one, two
3      two, three
4 one, two, three

我希望能够将此x 列分成许多不同的列，一个对应于x 列中的每个不同字。基本上我希望最终结果是这样的

    one  two  three
1    1    0     0
2    1    1     0
3    0    1     1
4    1    1     1

我认为为了获得该数据帧，我可能需要能够使用tidyr 提供的separate 函数并记录here。但是，这需要了解正则表达式，而我对它们并不擅长。谁能帮我获取这个数据框？

重要提示：我不知道数字，也不知道单词的拼写。

重要示例

它也应该适用于空字符串。例如，如果我们有

df <- data.frame(x=c("one", "one, two", "two, three", "one, two, three", ""))

那么它也应该可以工作。

【问题讨论】：

试试library(splitstackshape); cSplit_e(df, split.col = "x", fixed = TRUE, type = "character", drop = TRUE, fill = 0L)
可能的欺骗：R: Split Variable Column into multiple (unbalanced) columns by comma
@markus 我会看看那个问题

标签： r regex tidyverse tidyr regex-lookarounds

【解决方案1】：

使用tidyverse，我们可以使用separate_rows拆分'x'列，创建一个序列列并使用pivot_wider from tidyr

library(dplyr)
library(tidyr)
df %>% 
   filter(!(is.na(x)|x==""))%>% 
   mutate(rn = row_number()) %>% 
   separate_rows(x) %>%
   mutate(i1 = 1) %>% 
   pivot_wider(names_from = x, values_from = i1, , values_fill = list(i1 = 0)) %>%
   select(-rn)
# A tibble: 4 x 3
#    one   two three
#  <dbl> <dbl> <dbl>
#1     1     0     0
#2     1     1     0
#3     0     1     1
#4     1     1     1

在上面的代码中，在我们用separate_rows扩展行之后，添加了rn列以使每一行具有不同的标识符，否则，它可能会导致pivot_wider中的list输出列当有是重复的元素。添加值为 1 的“i1”以在values_from 中使用。另一种选择是指定values_fn = length

或者我们可以在拆分base R中的'x'列后使用table

table(stack(setNames(strsplit(as.character(df$x), ",\\s+"), seq_len(nrow(df))))[2:1])

【讨论】：

我喜欢tidyverse 解决方案。您能否为其添加更多解释？我可以看到它有效，但不是 100% 确定为什么。第一个mutate 创建一个带有行号的列。然后separate_rows 以某种方式将逗号分隔的单词分成每个单词的一行。然后你用mutate创建一个1s 列...这就是我迷路的地方
当我尝试使用不同的数据集（包含 NA 值）时，tidyverse 解决方案会引发错误。
@Euler_Salter 好的，在这种情况下，只需使用之前的 sep="\\s*,\\s*"
非常感谢您帮助我！我问了你很多事情，但你都解决了，干得好！！
是的，我快写完了

【解决方案2】：

这是一个基本的 R 解决方案

# split strings by ", " and save in to a list `lst`
lst <- apply(df, 1, function(x) unlist(strsplit(x,", ")))

# a common set including all distinct words
common <- Reduce(union,lst)

# generate matrix which is obtained by checking if `common` can be found in the array in `lst`
dfout <- `names<-`(data.frame(Reduce(rbind,lapply(lst, function(x) +(common %in% x))),row.names = NULL),common)

这样

> dfout
  one two three
1   1   0     0
2   1   1     0
3   0   1     1
4   1   1     1

【讨论】：

谢谢！它似乎正在工作。你能评论一下吗？
@Euler_Salter 是的，我添加了一些 cmets，请查看我的更新
你能解释一下names<-吗？我也很难从?Reduce 理解Reduce。谢谢！
@its.me.adam names<-(x,val) 为对象分配名称，相当于names(x) <- vals。对于Reduce，也许您可以从blog.zhaw.ch/datascience/r-reduce-applys-lesser-known-brother获得更多信息

【解决方案3】：

您可以从您的列中构建一个模式并将其与tidyr::extract() 一起使用：

library(tidyverse)
cols <- c("one","two","three")
pattern <- paste0("(",cols,")*", collapse= "(?:, )*")
df %>% 
  extract(x, into = c("one","two","three"), regex = pattern) %>%
  mutate_all(~as.numeric(!is.na(.)))
#>   one two three
#> 1   1   0     0
#> 2   1   1     0
#> 3   0   1     1
#> 4   1   1     1

【讨论】：