【问题标题】:Lookup string with substring key R使用子字符串键 R 查找字符串
【发布时间】:2017-11-10 22:14:18
【问题描述】:

我有一长串共享子字符串的字符串。该列表来自事件流数据,因此有数万行,但我将针对此示例进行简化;宠物:

+--------------------------------+
|              Pets              |
+--------------------------------+
| "one calico cat that's smart"  |
| "German Shepard dog"           |
| "A Chameleon that is a Lizard" |
| "a cute tabby cat"             |
| "the fish guppy"               |
| "Lizard Gecko"                 |
| "German Shepard dog"           |
| "Budgie Bird"                  |
| "Canary Bird in a coal mine"   |
| "a chihuahua dog"              |
+--------------------------------+
dput output: structure(list(Pets = structure(c(8L, 6L, 1L, 3L, 9L, 7L, 6L, 4L, 5L, 2L),.Label = c("A Chameleon that is a Lizard", "a chihuahua dog", "a cute tabby cat", "Budgie Bird", "Canary Bird in a coal mine", "German Shepard dog", "Lizard Gecko", "one calico cat that's smart", "the fish guppy"), class = "factor")), .Names = "Pets", row.names = c(NA,  -10L), class = "data.frame")

我想根据宠物(狗、猫等)的 通用 类型添加信息,并且我有一个保存此信息的键表:

+----------+----------------+
|   key    | classification |
+----------+----------------+
| "dog"    | "canine"       |
| "cat"    | "feline"       |
| "lizard" | "reptile"      |
| "bird"   | "avian"        |
| "fish"   | "fish"         |
+----------+----------------+
dput output: structure(list(key = structure(c(3L, 2L, 5L, 1L, 4L), .Label = c("bird", "cat", "dog", "fish", "lizard"), class = "factor"), classification = structure(c(2L, 3L, 5L, 1L, 4L), .Label = c("avian", "canine", "feline", "fish", "reptile"), class = "factor")), .Names = c("key", "classification"), row.names = c(NA, -5L), class = "data.frame")

如何使用Pets 表中的“长字符串”在键表中找到相关的classification?问题是,我的查找字符串包含在键表中找到的子字符串。

我是这样开始使用 grepl 的:

key[grepl(pets[1,1], key[ , 2]), ]

但这不起作用,因为“calico cat”不在密钥列表中,但“cat”在。我正在寻找的结果是“feline”。

(注意:我不能简单地切换,因为在我自己的代码中,它位于一个应用函数中并循环遍历数据中的每一行。所以,而不是 pets[1,1] 它是 pets[n,1] 最后我打算将cbind的结果放到事件流数据上做进一步的分析。)

我在思考如何做到这一点时遇到了麻烦。有什么建议吗?

【问题讨论】:

  • 看来关键总是每个“长字符串”的第二个单词。这是一个合理的假设吗?
  • 不幸的是,没有。字符串有几个到多个不同的词。我只知道key 字在里面。
  • 那么你应该提供一个不符合这个假设的长字符串样本。另外,请通过将dput(my_data) 的输出复制并粘贴到您的问题中来提供您的数据集,而不是您当前的格式
  • 但是可以假设两个不同的key 不会不会出现在同一个“长字符串”中?
  • 对,没错,两个子串不会出现在同一个长串中。

标签: r substring


【解决方案1】:

您可以使用包fuzzyjoin 非常轻松地完成这些事情。

在这里你可以使用regex_left_join,它就像一个普通的左连接(例如dplyr::left_join),除了rwos匹配的条件是由正则表达式匹配决定的,比如stringr::str_detect

library(tibble)
library(fuzzyjoin)

pets <- tribble(
                            ~pets,
   "one calico cat that\'s smart",
             "German Shepard dog",
   "A Chameleon that is a Lizard",
               "a cute tabby cat",
                 "the fish guppy",
                   "Lizard Gecko",
             "German Shepard dog",
                    "Budgie Bird",
     "Canary Bird in a coal mine",
                "a chihuahua dog"
)

key <- tribble(
       ~key, ~classification,
      "dog",        "canine",
      "cat",        "feline",
   "lizard",       "reptile",
     "bird",         "avian",
     "fish",          "fish"
)

regex_left_join(pets, key, by = c("pets" = "key"), ignore_case = TRUE)

#> # A tibble: 10 x 3
#>                            pets    key classification
#>                           <chr>  <chr>          <chr>
#>  1  one calico cat that's smart    cat         feline
#>  2           German Shepard dog    dog         canine
#>  3 A Chameleon that is a Lizard lizard        reptile
#>  4             a cute tabby cat    cat         feline
#>  5               the fish guppy   fish           fish
#>  6                 Lizard Gecko lizard        reptile
#>  7           German Shepard dog    dog         canine
#>  8                  Budgie Bird   bird          avian
#>  9   Canary Bird in a coal mine   bird          avian
#> 10              a chihuahua dog    dog         canine

【讨论】:

    【解决方案2】:

    您可以为每个 Pet 构建键列表,然后在表格中查找它们

    Pattern = paste(KeyTable$key, collapse="|")
    Pattern = paste0(".*(", Pattern, ").*")
    Type = tolower(sub(Pattern, "\\1", ignore.case=TRUE, Pets))
    KeyTable$classification[match(Type, KeyTable$key)]
     [1] "feline"  "canine"  "reptile" "feline"  "feline"  "canine"  "fish"   
     [8] "reptile" "canine"  "avian"   "avian"   "canine"
    

    数据

    KeyTable = read.table(text="key classification 
    dog  canine
    cat  feline   
    lizard reptile
    bird  avian    
    fish  fish", 
    header=TRUE, stringsAsFactors=FALSE)
    
    Pets  = c("calico cat",
    "Shepard dog"  ,
    "Chameleon Lizard", 
    "calico cat",
    "tabby cat",
    "chihuahua dog",
    "guppy fish",
    "Gecko Lizard",
    "Shepard dog",
    "Budgie Bird",
    "Canary Bird" ,
    "chihuahua dog")
    

    【讨论】:

      【解决方案3】:

      这是使用hashmap的另一种方法:

      library(hashmap)
      
      hash_table = hashmap(Lookup$key, Lookup$classification)
      
      Pets %>%
        separate_rows(Pets, sep = " ") %>%
        mutate(class = hash_table[[tolower(Pets)]]) %>%
        na.omit() %>%
        select(Key = Pets, class) %>%
        bind_cols(Pets, .)
      

      结果:

      > hash_table
      ## (character) => (character)
      ##      [fish] => [fish]     
      ##      [bird] => [avian]    
      ##    [lizard] => [reptile]  
      ##       [cat] => [feline]   
      ##       [dog] => [canine] 
      
                                 Pets    Key   class
      1   one calico cat that's smart    cat  feline
      2            German Shepard dog    dog  canine
      3  A Chameleon that is a Lizard Lizard reptile
      4              a cute tabby cat    cat  feline
      5                the fish guppy   fish    fish
      6                  Lizard Gecko Lizard reptile
      7            German Shepard dog    dog  canine
      8                   Budgie Bird   Bird   avian
      9    Canary Bird in a coal mine   Bird   avian
      10              a chihuahua dog    dog  canine
      

      数据:

      Pets = structure(list(Pets = c("one calico cat that's smart", "German Shepard dog", 
                                     "A Chameleon that is a Lizard", "a cute tabby cat", "the fish guppy", 
                                     "Lizard Gecko", "German Shepard dog", "Budgie Bird", "Canary Bird in a coal mine", 
                                     "a chihuahua dog")), .Names = "Pets", row.names = c(NA, -10L), class = "data.frame")
      
      
      Lookup = structure(list(key = c("dog", "cat", "lizard", "bird", "fish"), 
                              classification = c("canine", "feline", "reptile", "avian", 
                            "fish")), class = "data.frame", .Names = c("key", "classification"
                            ), row.names = c(NA, -5L))
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 2012-09-07
        • 2013-10-22
        • 2018-10-30
        • 1970-01-01
        • 2011-07-13
        相关资源
        最近更新 更多