在 R 中以单数和复数形式匹配列表中的产品答案

【问题标题】：match products in a list both in their singular and plural forms in R在 R 中以单数和复数形式匹配列表中的产品
【发布时间】：2021-04-22 23:01:47
【问题描述】：

我必须对这些产品列表进行分类：

product_list<-data.frame(product=c('banana from ecuador 1 unit', 'argentinian meat (1 kg) cow',
'chicken breast','noodles','salad','chicken salad with egg','chicken breasts','eggs from chickens'))

基于此向量的每个元素中包含的单词：

product_to_match<-c('cow meat','deer meat','cow milk','chicken breast','chicken egg salad','anana')

我必须匹配一个产品(for example: chicken egg)，记住编写产品的人可能会以singular/plural 和不同的顺序编写两者。所以他们可能会写'chickens egg'、'chicken eggs'、'egg chicken'等等。

在我看来，鉴于某些产品，例如 'chicken egg'，我必须：

'AND' 条件，其中列出的产品包含所有 N 字词。在这种情况下，它必须有 'egg' 和 'chicken' 这两个词。
'OR' 条件，因为每个单词都可能以单数或复数形式出现，例如 'egg' 和 'eggs'。

我希望标记product_list 的每一行，如下所示：

product_list<-data.frame(product=c('banana from ecuador 1 unit', 'argentinian meat (1 kg) cow','chicken breast',
'noodles','salad','chicken salad with egg','chicken brests','eggs from chickens'),class=c(NA,'cow meat','chicken breast',
NA,NA,'chicken egg salad','chicken breast','chicken egg'))

请注意 'anana' 与 'banana' 不匹配，即使字符包含在字符串中但不包含在单词中。

我能够实现一些结果，将产品拆分为单词，然后检查哪些匹配，但我在使用复数时遇到了一些麻烦。我知道正则表达式在这里可能有用，但我不知道如何。

谢谢。

【问题讨论】：

标签： r

【解决方案1】：

也许你可以试试outer + strsplit + grepl 如下所示

q <- outer(
  strsplit(product_to_match, "\\s+"),
  strsplit(product_list$product, "\\s+"),
  FUN = Vectorize(function(a, b) all(sapply(a, function(x) any(grepl(paste0("\\b", x), b)))))
)
product_list$class <- product_to_match[replace(colSums(q * row(q)), colSums(q) == 0, NA)]

给了

> product_list
                      product             class
1  banana from ecuador 1 unit              <NA>
2 argentinian meat (1 kg) cow          cow meat
3              chicken breast    chicken breast
4                     noodles              <NA>
5                       salad              <NA>
6      chicken salad with egg chicken egg salad
7             chicken breasts    chicken breast
8          eggs from chickens              <NA>

【讨论】：

【解决方案2】：

如果您想避免使用正则表达式，您可以尝试词干提取。这可以让你找到一些不规则的复数形式。

library(dplyr)
library(tidytext)
library(SnowballC)

wordStem(c("lady", "ladies"))
# [1] "ladi" "ladi"

product_list <- tibble(product = c('banana from ecuador 1 unit', 
    'argentinian meat (1 kg) cow', 'chicken breast',
    'noodles','salad','chicken salad with egg',
    'chicken breasts','eggs from chickens'), 
    id = seq_along(product))
product_to_match <- tibble(product_group = c('cow meat','deer meat',
        'cow milk','chicken breast','chicken egg salad','anana'), 
    pid = seq_along(product_group))

tidytext 包提供了将文档转换为标记/单词的框架。

# convert to tidy word lists
tidy_products <- product_list %>% 
    unnest_tokens(output = word, input = product)
tidy_products
# # A tibble: 23 x 2
#       id word       
#    <int> <chr>      
#  1     1 banana     
#  2     1 from       
#  3     1 ecuador    
#  4     1 1          
#  5     1 unit       
#  6     2 argentinian
#  7     2 meat       
#  8     2 1          
#  9     2 kg         
# 10     2 cow        
# # … with 13 more rows

SnowballC::wordStem 执行截断。

tidy_products <- mutate(tidy_products, 
    word = wordStem(word))
tail(tidy_products)
# # A tibble: 6 x 2
#      id word   
#   <int> <chr>  
# 1     6 egg    
# 2     7 chicken
# 3     7 breast 
# 4     8 egg    
# 5     8 from   
# 6     8 chicken

# same processing for products
tidy_match <- product_to_match %>% 
    unnest_tokens(output = word, input = product_group) %>% 
    mutate(word = wordStem(word))

从这里您可以检查完整字符串的相等性，例如使用匹配运算符%in%。通过这种方法，我们匹配所有单词是否以任何顺序出现。请注意，产品可能包含在其他产品中，例如牛肉和牛肉汉堡，所以匹配数据框的顺序很重要。

# choose first match
# matchdf must have columns word and pid
first_product_id <- function(string, matchdf) {
    out <- NA
    for (pid in split(matchdf, f = matchdf$pid)) {
        is_in <- pid$word %in% string
        if (length(is_in) == 0) { is_in <- FALSE }
        if (all(is_in)) {
            out <- pid$pid[1]
            break
        }
    }
    out
}
first_product_id(string = tidy_products$word[tidy_products$id == 3], 
    matchdf = tidy_match)
# [1] 4

# look up table where words are in 
lut <- tidy_products %>% 
    group_by(id) %>% 
    summarise(
        pid = first_product_id(string = word, matchdf = tidy_match))

product_list %>% 
    left_join(lut, by = "id") %>% 
    left_join(product_to_match, by = "pid")
# # A tibble: 8 x 4
# product                          id   pid product_group    
# <chr>                         <int> <int> <chr>            
# 1 banana from ecuador 1 unit      1    NA NA               
# 2 argentinian meat (1 kg) cow     2     1 cow meat         
# 3 chicken breast                  3     4 chicken breast   
# 4 noodles                         4    NA NA               
# 5 salad                           5    NA NA               
# 6 chicken salad with egg          6     5 chicken egg salad
# 7 chicken breasts                 7     4 chicken breast   
# 8 eggs from chickens              8    NA NA

【讨论】：