R：如何在数据框中的应用功能中提高 grepl 的性能答案

【问题标题】：R: How to improve performance of grepl in apply function within dataframeR：如何在数据框中的应用功能中提高 grepl 的性能
【发布时间】：2020-02-13 07:47:34
【问题描述】：

我有以下列的数据框：

country<- c("CA","IN","US")
text   <- c("paint red green", "painting red", "painting blue")
word   <- c("green, red, blue", "red", "red, blue")
df     <- data.frame(country, text, word)

对于每一行，我想在文本列中的文本中找到单词列中的单词并将它们呈现在一个新列中，因此将在文本中显示已建立的单词，以逗号分隔。所以新列应该是：

df$new_col   <- c("green,red","red","blue")

我正在使用这些代码行，但是运行起来需要很长时间，甚至崩溃。

setDT(df)[, new_col:= paste(df$word[unlist(lapply(df$word,function(x) grepl(x, df$text,
     ignore.case = T)))], collapse = ","), by = 1:nrow(df)]

有没有办法更改代码以提高效率？

非常感谢！

【问题讨论】：

您的可重现示例很棒，但您的实际用例的性能可能在很大程度上取决于您的数据细节。例如，如果word 列的不同条目相对较少，您可能需要应用unique。也可能有其他数据结构可供探索 - 目前您的解决方案适合在文档中展示，但在分析大型数据集时可能不太有用。
虽然你有一个代表很好，但我看不出setDT...代码是如何产生你想要的结果的。

标签： r performance dataframe lapply grepl

【解决方案1】：

试试这个

mapply(function(x,y){paste(intersect(x,y),collapse=", ")},
       strsplit(as.character(df$text),"\\, | "),
       strsplit(as.character(df$word),"\\, | "))

[1] "red, green" "red"        "blue"

【讨论】：

【解决方案2】：

使用mapply + grep + regmatches 的另一种基本 R 解决方案，即，

df <- within(df, newcol <- mapply(function(x,y) toString(grep(x,y,value = TRUE)), 
                                  gsub("\\W+","|",word), 
                                  regmatches(text,gregexpr("\\w+",text))))

这样

> df
  country            text             word     newcol
1      CA paint red green green, red, blue red, green
2      IN    painting red              red        red
3      US   painting blue        red, blue       blue

【讨论】：

【解决方案3】：

library(tidyverse)    
df %>% 
   mutate(newcol = stringr::str_extract_all(text,gsub(", +","|",word)))
      country            text             word     newcol
    1      CA paint red green green, red, blue red, green
    2      IN    painting red              red        red
    3      US   painting blue        red, blue       blue

在这种情况下，newcol 是一个列表。为了使它成为一个字符串，我们可以这样做：

df%>%
  mutate(newcol = text %>%
           str_extract_all(gsub(", +", "|", word)) %>%
           invoke(toString, .))

使用 data.table，您可以这样做：

 df[,id := .I][,newcol := do.call(toString,str_extract_all(text,gsub(', +',"|",word))),
      by = id][, id := NULL][]
   country            text             word     newcol
1:      CA paint red green green, red, blue red, green
2:      IN    painting red              red        red
3:      US   painting blue        red, blue       blue

【讨论】：

嗨，谢谢。但是当我在我的数据框上运行它时出现错误：Error in if (missing(width) || is.null(width) || width == 0) return(string) : missing value where TRUE/FALSE needed
@MatanRetzer 代码在您给出的示例中有效吗？
你说得对，我的数据和这个例子有点不同，主要是大写
@你可以全部改成小写，或者使用ignore.case = TRUE