【问题标题】:Count number of exactly matching words in a string计算字符串中完全匹配的单词数
【发布时间】:2020-12-18 00:45:22
【问题描述】:

我有一个带有 id 列的 tibble 和一个捕获人们输入的一些 text_entry 的列。
目标:将每个人的 text_entrykey 和计算完美键入的单词的数量。
例如,如果我的输入是:

df <- tribble(~id, ~text_entry,
              1, "It was a Saturday night in December.",
              2, " It was a Saturday night",
              3, "It wuz a Sturday nite in",
              4, "IT WAS A SATURDAY",
              5, "was a Saturday"); df

key <- "It was a Saturday night in December."

那么我想要以下内容:

df2 <- tribble(~id, ~text_entry, ~words_correct, 
               1, "It was a Saturday night in December.", 7, # whole string perfect
               2, " It was a Saturday night", 5,             # first 5 words perfect
               3, "It wuz a Sturday nite in", 3,             # misspelled "was", "Saturday" and "night"
               4, "IT WAS A SATURDAY", 0,                    # case-sensitive
               5, "was a Saturday", 3); df2                  # ok to start several words into the key

我对@9​​87654327@/stringi 解决方案非常满意。 tidyverse 总是首选,但我迫切需要任何解决方案。

非常感谢您提前提供的帮助和见解!

【问题讨论】:

    标签: r string tidyverse stringr stringi


    【解决方案1】:

    您可以提取非空格部分并将它们传递给str_detect()

    library(tidyverse)
    
    df %>%
      mutate(words_correct = map_dbl(str_extract_all(text_entry, "[^\\s]+"),
                                     ~ sum(str_detect(key, .))))
    
    # # A tibble: 5 x 3
    #      id text_entry                             words_correct
    #   <dbl> <chr>                                          <dbl>
    # 1     1 "It was a Saturday night in December."             7
    # 2     2 " It was a Saturday night"                         5
    # 3     3 "It wuz a Sturday nite in"                         3
    # 4     4 "IT WAS A SATURDAY"                                0
    # 5     5 "was a Saturday"                                   3
    

    【讨论】:

    • 这是一个完美的解决方案。谢谢@Darren Tsai!
    【解决方案2】:

    一种方法是将字符串拆分为空格,然后用key 计算常用词。

    library(tidyverse)
    
    keywords <- strsplit(key, '\\s+')[[1]]
    
    df %>%
      mutate(text = str_split(text_entry, '\\s+'), 
             words_correct = map_dbl(text, ~sum(.x %in% keywords)))
    
    # A tibble: 5 x 3
    #     id text_entry                             words_correct
    #  <dbl> <chr>                                          <dbl>
    #1     1 "It was a Saturday night in December."             7
    #2     2 " It was a Saturday night"                         5
    #3     3 "It wuz a Sturday nite in"                         3
    #4     4 "IT WAS A SATURDAY"                                0
    #5     5 "was a Saturday"                                   3
    

    我们也可以在基础 R 中做到这一点:

    df$words_correct <- sapply(strsplit(df$text_entry, '\\s+'), 
                               function(x) sum(x %in% keywords))
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2021-08-02
      • 2011-11-30
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多