【问题标题】:R grepl with dynamic search pattern具有动态搜索模式的 R grepl
【发布时间】:2021-10-23 13:33:18
【问题描述】:

我有一个数据框df,它有一列不同的名称。我有可变数据框,例如search_dfsearch_df1 包含我想在名称列中通过正则表达式搜索的搜索词。 如果找到该单词,请将其写入新列,例如df_final$which_word_search_df。 如果找到多个单词,我想将结果粘贴在一起。 结果应该类似于df_final

# load packages
pacman::p_load(tidyverse)

# words I would like to search for
search_df <- data.frame(search_words = c("apple", "peach"))
search_df1 <- data.frame(search_words = c("strawberry", "peach", "banana"))

# data frame which is the basis for my search
df <- data.frame(name = c("apple123", "applepeach", "peachtime", "peachab", "bananarrr", "bananaxy"))

# how I expect the final result to look like
df_final <- data.frame(name = c("apple123", "applepeach", "peachtime", "peachab", "bananarrr", "bananaxy"),
                       which_word_search_df = c("apple", "apple; peach", "peach", "peach", NA, NA),
                       which_word_search_df1 = c(NA, NA, "peach", "peach", "banana", "banana"))

这是我目前的解决方案,但您可以看到它不是动态的。我手动输入每个搜索词,而不是自动遍历所有搜索词。

df_trial <- df %>% 
  mutate(which_search_word_trial = ifelse(grepl("apple", name, ignore.case = T), "apple", ""),
         which_search_word_trial = ifelse(grepl("peach", name, ignore.case = T), 
                                          paste(which_search_word_trial, "peach", sep = ";"), which_search_word_trial)
  )

我分享的例子只是一个最小的例子。对于实际用例,df 将有 ~200k 行,而我的 search_df 将有 ~1k 行。

【问题讨论】:

    标签: r dplyr grepl


    【解决方案1】:

    我们可以做到以下几点。

    library(dplyr)
    library(stringr)
    
    df %>%
      mutate(which_word_search_df = str_extract_all(name,str_c(search_df$search_words, collapse = '|')),
             which_word_search_df1 = str_extract_all(name, str_c(search_df1$search_words, collapse = '|')))
    
    #         name which_word_search_df which_word_search_df1
    # 1   apple123                apple                      
    # 2 applepeach         apple, peach                 peach
    # 3  peachtime                peach                 peach
    # 4    peachab                peach                 peach
    # 5  bananarrr                                     banana
    # 6   bananaxy                                     banana
    

    【讨论】:

      【解决方案2】:

      使用您的 df 作为输入(而不是 df_final):这是通过提供搜索数据框的名称的“自动”方式:

      n = c('search_df','search_df1')
      
      for(i in n){
        a= (lapply(get(i)$search_word, function(j){grep(j, df$name)}))
        a=stack(setNames(a,get(i)$search_word))
        df[,paste0('which_word_',i)]=NA
        df[a$values,paste0('which_word_',i)]=as.character(a$ind)
      }
      

      输出直接存储在df 中,但您可以通过将df 复制到final_df 来轻松更改它,然后在最后两行中使用这个。

      输出:

             name which_word_search_df which_word_search_df1
      1  apple123                apple                  <NA>
      2  applebum                apple                  <NA>
      3 peachtime                peach                 peach
      4   peachab                peach                 peach
      5 bananarrr                 <NA>                banana
      6  bananaxy                 <NA>                banana
      

      让我知道它是否适合你

      【讨论】:

        猜你喜欢
        • 2014-12-01
        • 1970-01-01
        • 1970-01-01
        • 2021-09-06
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2021-11-22
        • 2017-12-01
        相关资源
        最近更新 更多