【问题标题】:Remove both English and Non-English names from a dataframe从数据框中同时删除英文和非英文名称
【发布时间】:2021-08-27 03:30:18
【问题描述】:

我正在处理数百行垃圾数据。一个虚拟数据是这样的:

   foo_data <- c("Mary Smith is not here", "Wiremu Karen is not a nice person",
                  "Rawiri Herewini is my name", "Ajibade Smith is my man", NA)

我需要删除所有姓名(英语和非英语名字和姓氏,这样我想要的输出将是:

[1] "is not here"         " is not a nice person" " is my name"  
[4] "is my man"           NA  

但是,使用 textclean 包,我只能删除英文名称,留下非英文名称:

library(textclean)
textclean::replace_names(foo_data)

[1] "  is not here"     "Wiremu  is not a nice person"    "Rawiri Herewini is my name"  
[4] "Ajibade  is my man"           NA

任何帮助将不胜感激。

【问题讨论】:

标签: r string replace text-mining data-cleaning


【解决方案1】:

你可以这样做:

s <- textclean::replace_names(foo_data)
trimws(gsub(sprintf('\\b(%s)\\b', 
      paste0(unlist(hunspell::hunspell(s)), collapse = '|')), '', s))

[1] "is not here"          "is not a nice person" "is my name"           "is my man"            NA  

【讨论】:

  • 谢谢@Onyambu。当我为小数据集运行您的代码时,它看起来不错,但在大数据集上,它显示此错误:Error in gsub(sprintf("\\b(%s)\\b", paste0(unlist(hunspell::hunspell(foo_data)), : assertion 'tree-&gt;num_tags == num_tags' failed in executing regexp: file 'tre-compile.c', line 634
猜你喜欢
  • 2016-11-07
  • 1970-01-01
  • 2011-04-08
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2021-04-09
  • 1970-01-01
相关资源
最近更新 更多