R：将字符串拆分为不同的变量，如果字符串包含此单词，则分配 1 [重复]答案

【问题标题】：R: Splitting a string to different variables and assign 1 if string contains this word [duplicate]R：将字符串拆分为不同的变量，如果字符串包含此单词，则分配 1 [重复]
【发布时间】：2020-02-25 22:13:02
【问题描述】：

我需要一些帮助来设置 R 中的代码来解决问题：

我想将一些字符串数据作为输入提供给 R，其中包含多个单词（短语、推文、任何你想要的）。字符串也可以有多个 " " 或 "," 作为分隔符。

样本输入数据

我希望 R 为所有输入字符串中的每个唯一单词设置一个变量，并在字符串包含此特定单词时设置为 1（或 TRUE，或其他任何值）。

所以我想要的输出看起来像这样：

样本输出

列中的空格应包含 0，为了便于阅读，我将它们省略了。

老实说，我不是循环方面的专家，我认为使用包可能会有更简单的解决方案。感谢您的网站在此主题上提供的任何支持，因为我有几个不同的项目，解决方案可以为我节省大量时间。

编辑：我想保留原始 ID 和字符串以供进一步处理。

【问题讨论】：

你能解释一下如何在 R 中做到这一点吗？你熟悉数据框吗？
我一般都熟悉数据框，但不熟悉如何在这里应用简单的数据框。我认为可能有一个两步解决方案。首先识别所有唯一值并使它们成为列，然后通过测试字符串中是否存在单词来填充列第 2 步。但我希望有一个现成的解决方案，也可以节省运行时间。

标签： r string

【解决方案1】：

首先，对于以后的帖子，请以可复制且可复制和粘贴的格式提供示例数据。屏幕截图不是一个好主意，因为我们不能轻易地从图像中提取数据。更多详情，请查看如何提供minimal reproducible example/attempt。

除此之外，这里有一个tidyverse 解决方案

library(tidyverse)
df %>%
    separate_rows(Text, sep = " ") %>%
    mutate(n = 1) %>%
    pivot_wider(names_from = "Text", values_from = "n", values_fill = list(n = 0))
## A tibble: 5 x 6
#  ID      Peanut Butter Jelly Storm  Wind
#  <fct>    <dbl>  <dbl> <dbl> <dbl> <dbl>
#1 ID-0001      1      1     1     0     0
#2 ID-0002      1      0     0     0     0
#3 ID-0003      0      1     0     0     0
#4 ID-0004      0      0     0     1     0
#5 ID-0005      0      1     0     1     1

说明：我们使用separare_rows 将Text 中的条目拆分为空白，并将数据重新整形为长格式；然后我们添加一个计数列；最后，我们用pivot_wider 将数据从长到宽重新整形，并用0 填充缺失值。

或者在基础 R 中使用 xtabs

df2 <- transform(df, Text = strsplit(as.character(Text), " "))
xtabs(n ~ ., data.frame(
    ID = with(df2, rep(ID, vapply(Text, length, 1L))),
    Text = unlist(df2$Text),
    n = 1))
#ID        Butter Jelly Peanut Storm Wind
#  ID-0001      1     1      1     0    0
#  ID-0002      0     0      1     0    0
#  ID-0003      1     0      0     0    0
#  ID-0004      0     0      0     1    0
#  ID-0005      1     0      0     1    1

样本数据

df <- read.table(text =
"ID Text
ID-0001   'Peanut Butter Jelly'
ID-0002   Peanut
ID-0003   Butter
ID-0004   Storm
ID-0005   'Storm Wind Butter'", header = T)

【讨论】：

xtabs 解决方案最适合我，因为我在使用“pivot_wider”时收到错误消息。
嗨@AMWiedl；您可能需要更新tidyr； pivot_wider 于 2019 年 9 月在 tidyr_1.0.0 中引入，旨在替换 spread（与 pivot_longer 替换 gather 的方式相同）。

【解决方案2】：

在基础 R 中，您想要的两步解决方案如下所示：

# Extract all words, keep only unique words, sort in alphabetic order:
all_words <- sort(unique(unlist(strsplit(df$strings, "\\W"))))

# Fill columns with 1 or 0 depending on whether the word is present in each string
cbind(df, sapply(all_words, function(x) 1 * grepl(x, df$strings)))
#>       ID             strings Butter Jelly Peanut Storm Wind
#> 1 ID0001 Peanut Butter Jelly      1     1      1     0    0
#> 2 ID0002              Peanut      0     0      1     0    0
#> 3 ID0003              Butter      1     0      0     0    0
#> 4 ID0004               Storm      0     0      0     1    0
#> 5 ID0005   Storm Wind Butter      1     0      0     1    1

使用的数据：

df <- structure(list(ID = c("ID0001", "ID0002", "ID0003", "ID0004", 
      "ID0005"), strings = c("Peanut Butter Jelly", "Peanut", "Butter", 
      "Storm", "Storm Wind Butter")), class = "data.frame", row.names = c(NA, -5L))

^{由reprex package (v0.3.0) 于 2020 年 2 月 25 日创建}

【讨论】：