【问题标题】:Count Unique Word Matches in Column计算列中的唯一单词匹配
【发布时间】:2022-06-21 07:39:50
【问题描述】:

我有兴趣将一列中的唯一匹配数计算到单词列表中。我想计数在数据框中的新列中,以便每一行都有一个计数。

例如:

person_id <- c("001", "002", "003")
grocery_list <- c("apple orange orange kiwi", "eggs milk apple apple", "apple orange banana")

df <- data.frame(person_id, grocery_list)

fruit_list <- c("apple", "orange", "banana") 

输出将是:

person_id grocery_list                   fruit_count
001       apple orange orange kiwi       2
002       eggs milk apple apple          1
003       apple orange banana            3

【问题讨论】:

  • grocery_list 中是否也可以包含orangeade?如果是,orange 是否应该匹配?
  • 是不是也有一些像Coffee Bean这样长于一个单词的水果需要搭配在一起?
  • 是的,有超过一个单词的单词需要匹配。例如,我希望fruit_list 中的“血橙”与“橙”不同。
  • 某些答案是否满足了与blood orange 匹配的额外需求,并且不将其也算作orange

标签: r


【解决方案1】:

应该这样做:

library(tidyverse)
person_id <- c("001", "002", "003")
grocery_list <- c("apple orange orange kiwi", "eggs milk apple apple", "apple orange banana")

df <- data.frame(person_id, grocery_list)

fruit_list <- c("apple", "orange", "banana") 


df %>% 
  rowwise() %>% 
  mutate(fruit_count = sum(str_detect(grocery_list, fruit_list)))
#> # A tibble: 3 × 3
#> # Rowwise: 
#>   person_id grocery_list             fruit_count
#>   <chr>     <chr>                          <int>
#> 1 001       apple orange orange kiwi           2
#> 2 002       eggs milk apple apple              1
#> 3 003       apple orange banana                3

reprex package (v2.0.1) 于 2022-06-03 创建

【讨论】:

  • 根据数据集的大小,@onyambu 提出的解决方案可能会更快。在大型数据集上,逐行操作可能很耗时。
【解决方案2】:

你可以这样做:

df["fruit_count"] = sapply(df$grocery_list, \(s) sum(fruit_list %in% strsplit(s," ")[[1]]))

输出:

  person_id             grocery_list fruit_count
1       001 apple orange orange kiwi           2
2       002    eggs milk apple apple           1
3       003      apple orange banana           3

【讨论】:

    【解决方案3】:

    在 Base R 中,您将使用 greplVectorized 版本,然后使用 rowSums

    df$fruit_count <- rowSums(Vectorize(grepl, 'pattern')(fruit_list, df$grocery_list))
    df
      person_id             grocery_list fruit_count
    1       001 apple orange orange kiwi           2
    2       002    eggs milk apple apple           1
    3       003      apple orange banana           3
    

    【讨论】:

      【解决方案4】:

      试试

      transform(
          df,
          fruit_count = rowSums(sapply(fruit_list, grepl, grocery_list))
      )
      

      给了

        person_id             grocery_list fruit_count
      1       001 apple orange orange kiwi           2
      2       002    eggs milk apple apple           1
      3       003      apple orange banana           3
      

      【讨论】:

        【解决方案5】:

        在@akrun 的帮助下,这里有一个str_count 的解决方案: Shortest way to remove duplicate words from string

        library(dplyr)
        library(stringr)
        
        df %>% 
          rowwise() %>% 
          mutate(count = str_count(paste(unique(unlist(strsplit(grocery_list, " "))), collapse = " ") , paste(fruit_list, collapse = "|")))
        
        
          person_id grocery_list             count
          <chr>     <chr>                    <int>
        1 001       apple orange orange kiwi     2
        2 002       eggs milk apple apple        1
        3 003       apple orange banana          3
        

        【讨论】:

          【解决方案6】:

          您可以获取lengthsgregexpr 命中。 (?!.*\\b\\1\\b) 是一个负面的前瞻性测试,以测试在 apple|orange|banana 之前捕获的内容之后是否没有命中。

          df$fruit_count <- lengths(gregexpr(paste0("\\b(", paste(fruit_list
           , collapse="|"), ")\\b\\s*(?!.*\\b\\1\\b)"), df$grocery_list, perl=TRUE))
          
          df
          #  person_id             grocery_list fruit_count
          #1       001 apple orange orange kiwi           2
          #2       002    eggs milk apple apple           1
          #3       003      apple orange banana           3
          

          只是为了好玩的基准!

          person_id <- c("001", "002", "003")
          grocery_list <- c("apple orange orange kiwi", "eggs milk apple apple", "apple orange banana")
          df <- data.frame(person_id, grocery_list)
          fruit_list <- c("apple", "orange", "banana") 
          
          library(magrittr)
          
          bench::mark(check = FALSE,
           DaveArmstrong = (\(df) {df %>% 
                         dplyr::rowwise() %>% 
                         dplyr::mutate(fruit_count = sum(stringr::str_detect(grocery_list, fruit_list)))})(df),
           onyambu = (\(df) {df$fruit_count <- rowSums(Vectorize(grepl, 'pattern')(fruit_list, df$grocery_list))
             df})(df),
           langtang = (\(df) {df["fruit_count"] = sapply(df$grocery_list, \(s) sum(fruit_list %in% strsplit(s," ")[[1]]))})(df),
           ThomasIsCoding = (\(df) {transform(
              df,
              fruit_count = rowSums(sapply(fruit_list, grepl, grocery_list))
              )})(df),
           TarJae = (\(df) {df %>% 
            dplyr::rowwise() %>% 
            dplyr::mutate(count = stringr::str_count(paste(unique(unlist(strsplit(grocery_list, " "))), collapse = " ") , paste(fruit_list, collapse = "|")))
          })(df),
           GKi = (\(df) {df$fruit_count <- lengths(gregexpr(paste0("\\b(", paste(fruit_list
           , collapse="|"), ")\\b\\s*(?!.*\\b\\1\\b)"), df$grocery_list, perl=TRUE))
          df})(df)
          )
          

          结果

            expression          min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
            <bch:expr>     <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>
          1 DaveArmstrong       2ms   2.08ms      417.     6.9KB     17.0   172     7
          2 onyambu         73.18µs  77.33µs    11668.        0B     25.1  5571    12
          3 langtang        44.89µs  48.59µs    19261.        0B     23.2  9146    11
          4 ThomasIsCoding 102.82µs 112.02µs     7055.        0B     18.7  3391     9
          5 TarJae           2.06ms   2.12ms      412.     6.9KB     17.2   192     8
          6 GKi             17.97µs  19.64µs    47069.    48.6KB     42.4  9991     9
          

          在这种情况下,GKilangtang2 倍,其次是 onyambu ThomasIsCodingDaveArmstrongTarJae 比最快的要慢 100 倍

          【讨论】:

            猜你喜欢
            • 1970-01-01
            • 1970-01-01
            • 2012-08-07
            • 1970-01-01
            • 2013-12-25
            • 1970-01-01
            • 1970-01-01
            • 1970-01-01
            • 1970-01-01
            相关资源
            最近更新 更多