【问题标题】:Search for the inclusion of specific text across multiple dataframes, and return those values in a new column (with multiple occurrences)在多个数据框中搜索包含的特定文本,并在新列中返回这些值(多次出现)
【发布时间】:2021-05-05 12:12:04
【问题描述】:

在从一个数据帧、另一个数据帧的列(文本正文)中搜索多个特定单词并随后将这些值提取到新列中时寻求帮助。

进一步解释:

  • 首先,我有一个数据框,其中包含跨越 14 个国家/地区的大量文本摘要列表。
  • 其次,我有第二个数据框,其中包含所有行政级别 (lvl_2) 名称,例如省、村等。
  • 我想基本上从大型摘要中提取对这些特定 adm2 省份/村庄名称的任何提及,并使用这些词中的每一个创建一个新列,并延长旋转时间。

这里有一些示例数据,您可以使用它们来重新创建我的问题,其中包含两个数据框:(1) test_admin 用于我要搜索的管理员级别列表,以及 (2) test_dataset$Summary 这是列我想继续搜索。 (您可以忽略 Other_Variables 的值,这些值在真实数据集中填充了很多值)

test_admin <- data.frame(adm1_name = c("Sindh"),
                   adm2_name = c("Central Karachi", "Dadu", "East Karachi", "Ghotki", "Sujawal", "Sukkur"))
                   
test_dataset <- data.frame(Summary = c("In Cox's Bazar, this and that happened.",
                                       "In Yangon, something else happened",
                                       "In Central Karachi, this happened",
                                       "In Sindh, this happened",
                                       "In Dadu AND East Karachi, this happened"),
                           Other_Variable_1 = 1:5,
                           Other_Variable_2 = 1:5)

为了使事情更加复杂,我还希望能够从 test_admin 数据框的 两个 列中搜索值。例如,如果您的值“Sindh”来自 adm1_level 列,那么返回 adm2_level 下的所有结果也将非常酷。

但如果你能在更基础的层面解决它(只搜索一列),我也会很满意。

我要查找的输出类似于下面的数据框,它还会返回多行以显示出现多个值的位置。

                                   Summary Other_Variable_1 Other_Variable_2       Locations
1  In Cox's Bazar, this and that happened.                1                1            <NA>
2       In Yangon, something else happened                2                2            <NA>
3        In Central Karachi, this happened                3                3 Central Karachi
4                  In Sindh, this happened                4                4 Central Karachi
5                  In Sindh, this happened                4                4            Dadu
6                  In Sindh, this happened                4                4    East Karachi
7                  In Sindh, this happened                4                4          Ghotki
8                  In Sindh, this happened                4                4         Sujawal
9                  In Sindh, this happened                4                4          Sukkur
10 In Dadu AND East Karachi, this happened                5                5            Dadu
11 In Dadu AND East Karachi, this happened                5                5    East Karachi

我尝试了一些 mutate 和 grepl 函数,但都失败了。我发现的其他示例似乎仅适用于精确值或单个搜索。感谢您的帮助!

#tidyverse 解决方案首选

【问题讨论】:

    标签: r search match tidyverse grepl


    【解决方案1】:

    这是一种方法:

    library(tidyverse)
    
    map_df(seq(nrow(test_dataset)), function(i) {
      inds <- str_detect(test_dataset$Summary[i], test_admin$adm1_name) | 
                 str_detect(test_dataset$Summary[i], test_admin$adm2_name)
      if(any(inds)) tibble(test_dataset[i, ], Locations = test_admin$adm2_name[inds])
        else tibble(test_dataset[i, ], Locations = NA)
    })
    
    #  Summary                                 Other_Variable_1 Other_Variable_2 Locations      
    #   <chr>                                              <int>            <int> <chr>          
    # 1 In Cox's Bazar, this and that happened.                1                1 NA             
    # 2 In Yangon, something else happened                     2                2 NA             
    # 3 In Central Karachi, this happened                      3                3 Central Karachi
    # 4 In Sindh, this happened                                4                4 Central Karachi
    # 5 In Sindh, this happened                                4                4 Dadu           
    # 6 In Sindh, this happened                                4                4 East Karachi   
    # 7 In Sindh, this happened                                4                4 Ghotki         
    # 8 In Sindh, this happened                                4                4 Sujawal        
    # 9 In Sindh, this happened                                4                4 Sukkur         
    #10 In Dadu AND East Karachi, this happened                5                5 Dadu           
    #11 In Dadu AND East Karachi, this happened                5                5 East Karachi   
    

    对于Summary 中的每个值,我们检查它是否匹配adm1_nameadm2_name。如果任何一行匹配,我们在输出中包含相应的Location 值,否则返回NA

    【讨论】:

    • 太好了,非常感谢!效果很好。
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2019-09-16
    • 2022-10-05
    • 1970-01-01
    • 2013-08-29
    相关资源
    最近更新 更多