【问题标题】:match products in a list both in their singular and plural forms in R在 R 中以单数和复数形式匹配列表中的产品
【发布时间】:2021-04-22 23:01:47
【问题描述】:

我必须对这些产品列表进行分类:

product_list<-data.frame(product=c('banana from ecuador 1 unit', 'argentinian meat (1 kg) cow',
'chicken breast','noodles','salad','chicken salad with egg','chicken breasts','eggs from chickens'))

基于此向量的每个元素中包含的单词:

product_to_match<-c('cow meat','deer meat','cow milk','chicken breast','chicken egg salad','anana')

我必须匹配一个产品(for example: chicken egg),记住编写产品的人可能会以singular/plural 和不同的顺序编写两者。所以他们可能会写'chickens egg''chicken eggs''egg chicken'等等。

在我看来,鉴于某些产品,例如 'chicken egg',我必须:

  1. 'AND' 条件,其中列出的产品包含所有 N 字词。在这种情况下,它必须有 'egg''chicken' 这两个词。
  2. 'OR' 条件,因为每个单词都可能以单数或复数形式出现,例如 'egg''eggs'

我希望标记product_list 的每一行,如下所示:

product_list<-data.frame(product=c('banana from ecuador 1 unit', 'argentinian meat (1 kg) cow','chicken breast',
'noodles','salad','chicken salad with egg','chicken brests','eggs from chickens'),class=c(NA,'cow meat','chicken breast',
NA,NA,'chicken egg salad','chicken breast','chicken egg'))

请注意 'anana''banana' 不匹配,即使字符包含在字符串中但不包含在单词中。

我能够实现一些结果,将产品拆分为单词,然后检查哪些匹配,但我在使用复数时遇到了一些麻烦。我知道正则表达式在这里可能有用,但我不知道如何。

谢谢。

【问题讨论】:

    标签: r


    【解决方案1】:

    也许你可以试试outer + strsplit + grepl 如下所示

    q <- outer(
      strsplit(product_to_match, "\\s+"),
      strsplit(product_list$product, "\\s+"),
      FUN = Vectorize(function(a, b) all(sapply(a, function(x) any(grepl(paste0("\\b", x), b)))))
    )
    product_list$class <- product_to_match[replace(colSums(q * row(q)), colSums(q) == 0, NA)]
    

    给了

    > product_list
                          product             class
    1  banana from ecuador 1 unit              <NA>
    2 argentinian meat (1 kg) cow          cow meat
    3              chicken breast    chicken breast
    4                     noodles              <NA>
    5                       salad              <NA>
    6      chicken salad with egg chicken egg salad
    7             chicken breasts    chicken breast
    8          eggs from chickens              <NA>
    

    【讨论】:

      【解决方案2】:

      如果您想避免使用正则表达式,您可以尝试词干提取。这可以让你找到一些不规则的复数形式。

      library(dplyr)
      library(tidytext)
      library(SnowballC)
      
      wordStem(c("lady", "ladies"))
      # [1] "ladi" "ladi"
      
      product_list <- tibble(product = c('banana from ecuador 1 unit', 
          'argentinian meat (1 kg) cow', 'chicken breast',
          'noodles','salad','chicken salad with egg',
          'chicken breasts','eggs from chickens'), 
          id = seq_along(product))
      product_to_match <- tibble(product_group = c('cow meat','deer meat',
              'cow milk','chicken breast','chicken egg salad','anana'), 
          pid = seq_along(product_group))
      

      tidytext 包提供了将文档转换为标记/单词的框架。

      # convert to tidy word lists
      tidy_products <- product_list %>% 
          unnest_tokens(output = word, input = product)
      tidy_products
      # # A tibble: 23 x 2
      #       id word       
      #    <int> <chr>      
      #  1     1 banana     
      #  2     1 from       
      #  3     1 ecuador    
      #  4     1 1          
      #  5     1 unit       
      #  6     2 argentinian
      #  7     2 meat       
      #  8     2 1          
      #  9     2 kg         
      # 10     2 cow        
      # # … with 13 more rows
      

      SnowballC::wordStem 执行截断。

      tidy_products <- mutate(tidy_products, 
          word = wordStem(word))
      tail(tidy_products)
      # # A tibble: 6 x 2
      #      id word   
      #   <int> <chr>  
      # 1     6 egg    
      # 2     7 chicken
      # 3     7 breast 
      # 4     8 egg    
      # 5     8 from   
      # 6     8 chicken
      
      # same processing for products
      tidy_match <- product_to_match %>% 
          unnest_tokens(output = word, input = product_group) %>% 
          mutate(word = wordStem(word))
      

      从这里您可以检查完整字符串的相等性,例如使用匹配运算符%in%。通过这种方法,我们匹配所有单词是否以任何顺序出现。请注意,产品可能包含在其他产品中,例如牛肉和牛肉汉堡,所以匹配数据框的顺序很重要。

      # choose first match
      # matchdf must have columns word and pid
      first_product_id <- function(string, matchdf) {
          out <- NA
          for (pid in split(matchdf, f = matchdf$pid)) {
              is_in <- pid$word %in% string
              if (length(is_in) == 0) { is_in <- FALSE }
              if (all(is_in)) {
                  out <- pid$pid[1]
                  break
              }
          }
          out
      }
      first_product_id(string = tidy_products$word[tidy_products$id == 3], 
          matchdf = tidy_match)
      # [1] 4
      
      # look up table where words are in 
      lut <- tidy_products %>% 
          group_by(id) %>% 
          summarise(
              pid = first_product_id(string = word, matchdf = tidy_match))
      
      product_list %>% 
          left_join(lut, by = "id") %>% 
          left_join(product_to_match, by = "pid")
      # # A tibble: 8 x 4
      # product                          id   pid product_group    
      # <chr>                         <int> <int> <chr>            
      # 1 banana from ecuador 1 unit      1    NA NA               
      # 2 argentinian meat (1 kg) cow     2     1 cow meat         
      # 3 chicken breast                  3     4 chicken breast   
      # 4 noodles                         4    NA NA               
      # 5 salad                           5    NA NA               
      # 6 chicken salad with egg          6     5 chicken egg salad
      # 7 chicken breasts                 7     4 chicken breast   
      # 8 eggs from chickens              8    NA NA   
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2021-04-24
        • 1970-01-01
        • 1970-01-01
        • 2020-11-22
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多