从数据框中同时删除英文和非英文名称答案

【问题标题】：Remove both English and Non-English names from a dataframe从数据框中同时删除英文和非英文名称
【发布时间】：2021-08-27 03:30:18
【问题描述】：

我正在处理数百行垃圾数据。一个虚拟数据是这样的：

   foo_data <- c("Mary Smith is not here", "Wiremu Karen is not a nice person",
                  "Rawiri Herewini is my name", "Ajibade Smith is my man", NA)

我需要删除所有姓名（英语和非英语名字和姓氏，这样我想要的输出将是：

[1] "is not here"         " is not a nice person" " is my name"  
[4] "is my man"           NA

但是，使用 textclean 包，我只能删除英文名称，留下非英文名称：

library(textclean)
textclean::replace_names(foo_data)

[1] "  is not here"     "Wiremu  is not a nice person"    "Rawiri Herewini is my name"  
[4] "Ajibade  is my man"           NA

任何帮助将不胜感激。

【问题讨论】：

翻转它：你想提取英文单词。 stackoverflow.com/questions/26715380/…
嗨@Roland，我关注了stackoverflow.com/questions/26715380/…，但结果不是我们想要的。
重点不是让你复制那个答案。关键是您需要一本字典，而答案中提到了一本。

标签： r string replace text-mining data-cleaning

【解决方案1】：

你可以这样做：

s <- textclean::replace_names(foo_data)
trimws(gsub(sprintf('\\b(%s)\\b', 
      paste0(unlist(hunspell::hunspell(s)), collapse = '|')), '', s))

[1] "is not here"          "is not a nice person" "is my name"           "is my man"            NA

【讨论】：

谢谢@Onyambu。当我为小数据集运行您的代码时，它看起来不错，但在大数据集上，它显示此错误：Error in gsub(sprintf("\\b(%s)\\b", paste0(unlist(hunspell::hunspell(foo_data)), : assertion 'tree->num_tags == num_tags' failed in executing regexp: file 'tre-compile.c', line 634