如果在查找表中找到较长字符串中的单词/短语，则替换它们答案

【问题标题】：Replace words/phrases within longer strings if they are found in lookup table如果在查找表中找到较长字符串中的单词/短语，则替换它们
【发布时间】：2020-07-04 10:11:05
【问题描述】：

我有一个句子的数据框和一个关键词及其同义词的数据框。我想查看句子的每一行并将找到的任何同义词替换为适当的关键字。在过去的几天里，我一直在努力解决这个问题，但运气不佳。因此，您可以提供任何建议将不胜感激！

样本数据：

sentences <- data.frame( ID = c( "1", "2", "3", "4"),
                         text = c("the kitten in the hat",
                                  "a dog with a bone",
                                  "this is a category",
                                  "their cat has no hat"),
                         stringsAsFactors=FALSE)

lookup <- data.frame( key = c("cat", "a", "has"),
                       synonym = c("kitten", "the", "with"),
                       stringsAsFactors=FALSE)

我想将数据作为数据框取回，就像原始“句子”一样，只是替换了同义词。例如：

ID        text
1        a cat in a hat
2        a dog has a bone
3        this is a category
4        their cat has no hat

实际数据由 2016 个句子组成，每个句子在 200-500 个单词之间。查找表包含大约 200,000 行单词和短语。我已经想出了如何轻松地替换单个单词和短语，但我不知道如何使用查找表来完成。

另一个让我感到悲伤的注意事项：我需要匹配包括特殊字符在内的确切单词/短语。例如，“adison's disease”应该匹配“adison's disease”，而不是“adisons disease”。 "cotton-roll" 应该匹配 "cotton-roll" 但它不应该匹配 "cottonroll" 或 "cotton roll"。

我正在使用 R 版本 3.6.2 (2019-12-12) 平台：x86_64-w64-mingw32/x64（64位）运行于：Windows 10 x64（内部版本 18362）

【问题讨论】：

标签： r text replace

【解决方案1】：

与@akrun 的答案基本相同，但我个人更喜欢stringr 的str_replace_all 的stringi 版本，它不会做奇怪的命名向量事情。所以这里有一个替代方案：

sentences$text <- stringi::stri_replace_all_regex(
  str = sentences$text,
  pattern = paste0("\\b", lookup$key, "\\b"),                    # add word boundaries
  replacement = lookup$synonym,
  vectorize_all = FALSE, 
  opts_regex = stringi::stri_opts_regex(case_insensitive = TRUE) # set additional options
)
sentences
#>   ID                     text
#> 1  1    the kitten in the hat
#> 2  2    the dog with the bone
#> 3  3     this is the category
#> 4  4 their kitten with no hat

【讨论】：

【解决方案2】：

使用 gsubfn 创建翻译列表trans，然后对于每个单词（由正则表达式定义，其中 \y 表示单词边界，\w 是单词字符）如果在 @987654323 中有匹配项，则使用 trans 替换它@：

library(gsubfn)

trans <- with(lookup, setNames(as.list(key), synonym))
transform(sentences, text = gsubfn("\\y\\w+\\y", trans, text))

给予：

  ID                 text
1  1       a cat in a hat
2  2     a dog has a bone
3  3   this is a category
4  4 their cat has no hat

【讨论】：

【解决方案3】：

这是str_replace_all的选项

library(stringr)
str_replace_all(sentences$text, setNames(lookup$key,
        str_c("\\b(", lookup$synonym, ")\\b")))
#[1] "a cat in a hat"       "a dog has a bone"     "this is a category"   "their cat has no hat"

或与dplyr一起使用

library(dplyr)
sentences %>%
   mutate(text = str_replace_all(text, 
         set_names(lookup$key,
        str_c("\\b(", lookup$synonym, ")\\b"))))
#  ID                 text
#1  1       a cat in a hat
#2  2     a dog has a bone
#3  3   this is a category
#4  4 their cat has no hat

【讨论】：

这太好了，谢谢。使用 dplyr 选项，我将如何转义特殊字符？查找数据是生物医学短语（基因、化学符号等）的列表，其中一些包含以下一个或多个字符：'-+/!%&()*,:[]