按匹配项的数量有效地对字符串匹配进行排名答案

【问题标题】：Efficiently ranking string matches by number of matched terms按匹配项的数量有效地对字符串匹配进行排名
【发布时间】：2020-09-18 15:48:19
【问题描述】：

总结：如何最有效地计算多个正则表达式匹配并按发生率对结果进行排名？是否应该使用语义方法来代替正则表达式？

用于说明的示例数据：

sample_string <- c("Total - Main mode of commuting for the employed labour force aged 15 years and over in private households with a usual place of work or no fixed workplace address - 25% sample data", 
"Total - Language used most often at work for the population in private households aged 15 years and over who worked since January 1, 2015 - 25% sample data", 
"Number of market income recipients aged 15 years and over in private households - 25% sample data", 
"Number of employment income recipients aged 15 years and over in private households", 
"Total - Major field of study - Classification of Instructional Programs (CIP) 2016 for the population aged 15 years and over in private households - 25% sample data", 
"Total - Selected places of birth for the recent immigrant population in private households - 25% sample data", 
"Total - Commuting duration for the employed labour force aged 15 years and over in private households with a usual place of work or no fixed workplace address - 25% sample data", 
"Number of market income recipients aged 15 years and over in private households", 
"Employment income (%)", "Total - Aboriginal ancestry for the population in private households - 25% sample data", 
"Without employment income", "With after-tax income", "1 household maintainer", 
"Spending 30% or more of income on shelter costs", "Total - Highest certificate, diploma or degree for the population aged 25 to 64 years in private households - 25% sample data"
)

还有一个包含多个词条的示例字符串查询

sample_query <- c("after tax income")

使用grepl 很容易检查字符串查询是否匹配。

sample_string[grepl(sample_query, sample_string)]

但显然这在这里行不通，因为没有完全匹配，因为实际术语是after-tax income。另一种方法是将搜索查询分成几部分并进行检查。

sample_string[grepl(paste(unlist(strsplit(sample_query, " +")), collapse = "|"), sample_string)]

这可行，但会返回太多结果，因为它匹配任何这些术语的任何实例。

[1] "Number of market income recipients aged 15 years and over in private households - 25% sample data"
[2] "Number of employment income recipients aged 15 years and over in private households"              
[3] "Number of market income recipients aged 15 years and over in private households"                  
[4] "Employment income (%)"                                                                            
[5] "Without employment income"                                                                        
[6] "With after-tax income"                                                                            
[7] "Spending 30% or more of income on shelter costs"

问题：如何根据单个匹配的数量有效地返回最接近的匹配？

应用一些答案here，并添加排序和匹配会导致怪物：

sample_string[grepl(paste(unlist(strsplit(sample_query, " +")),
                          collapse = "|"),
                    sample_string)][order(-lengths(regmatches(
                      sample_string[grepl(paste(unlist(strsplit(sample_query, " +")),
                                                collapse = "|"),
                                          sample_string)],
                      gregexpr(paste(unlist(
                        strsplit(sample_query, " +")
                      ),
                      collapse = "|"),
                      sample_string[grepl(paste(unlist(strsplit(sample_query, " +")),
                                                collapse = "|"),
                                          sample_string)])
                    )))]

返回我想要的 - 包含至少一个匹配项的所有字符串的列表，按匹配项数排序。

[1] "With after-tax income"                                                                            
[2] "Number of market income recipients aged 15 years and over in private households - 25% sample data"
[3] "Number of employment income recipients aged 15 years and over in private households"              
[4] "Number of market income recipients aged 15 years and over in private households"                  
[5] "Employment income (%)"                                                                            
[6] "Without employment income"                                                                        
[7] "Spending 30% or more of income on shelter costs"

稍微清理一下上面的怪物：

to_match <- paste(unlist(strsplit(sample_query, " +")),collapse = "|")
results <- sample_string[grepl(to_match,sample_string)]
results[order(-lengths(regmatches(results,gregexpr(to_match,results))))]

我可以忍受这个，但有没有办法让它更简洁？而且，我想知道这是否是解决此问题的最佳方法？

我知道stringr::str_count 和stringi::stri_count_regex。这是一个包，我试图避免添加额外的依赖项，但如果这些更有效，我可以改用它。

或者，替代字符串距离是更好的选择吗？检查数千个长字符串时会更好吗？

目的是帮助用户找到相关信息，也许有一些更面向语义的东西是有意义的。

【问题讨论】：

由于您的代码已经在运行，您可以尝试将您的问题迁移到Code Review。
@dshkol 请看我的回复，它是在基础 R 中完成的，它使用 levenshtein（编辑）距离并检索单个最相关的句子。

标签： r regex stringr grepl

【解决方案1】：

我确信这可以改进，但这是使用 Levenshtein Distance 后的一种方法：

# Desired query scalar: actual_query => character vector
actual_query <- "after tax income"

# Separate words in query: query_words => character vector: 
query_words <- unlist(strsplit(tolower(actual_query), "[^a-z]+"))

# Calculate n (scalar) for n-grams: word_count => integer vector
word_count <- length(query_words)

# Split each word preserving any non-character values: 
# sentence_word_split => character vector
sentence_word_split <- strsplit(tolower(sample_string), "\\s+")

# Split original sentences into n-grams (relative to query length): 
# n_grams => list 
n_grams <- lapply(sentence_word_split, function(x){
              sapply(seq_along(x), function(i){
                paste(x[i:min(length(x), ((i+word_count)-1))], sep = " ", collapse = " ")
      }
    )
  }
)

# Rank ngrams based on the frequency of their occurence in sample string: 
# ordered_n_gram => character vector
ordered_ngram_count <- trimws(names(sort(table(unlist(n_grams)), decreasing = TRUE)), "both")

# Combine the query with each of its elements: revised_query => character vector 
revised_query <- c(actual_query, unlist(strsplit(actual_query, "\\s+")))

# Use levenshtein distance to determine similarity of revised_query 
# to the expressions in the ordered_ngram_count: lev_dist_df => data.frame
lev_dist_df <- setNames(data.frame(sapply(seq_along(revised_query), 
                    function(i){
                       adist(revised_query[i], ordered_ngram_count)
                      }
                    )), gsub("\\s+", "_", revised_query))

# Example of applying function returning string element in sample string 
# with the minimum edit distance: sample_string element => stdout (console)
grep(grep(ordered_ngram_count[sapply(seq_along(ncol(lev_dist_df)),
                                     function(i) {
                                       which.min(lev_dist_df[, i])
                                     })], sample_string,
          value = TRUE),
     sample_string,
     value = TRUE)

上述更简洁的版本：

# Desired query scalar: sample_query => character vector
sample_query <- "after tax income"

# Separate words in query: query_words => character vector: 
query_words <- unlist(strsplit(tolower(sample_query), "[^a-z]+"))

# Calculate n (scalar) for n-grams: word_count => integer vector
word_count <- length(query_words)

# Split each word preserving any non-character values: 
# sentence_word_split => character vector
sentence_word_split <- strsplit(tolower(sample_string), "\\s+")

# Split original sentences into n-grams (relative to query length): 
# n_grams => list 
n_grams <- lapply(sentence_word_split, function(x){
    sapply(seq_along(x), function(i){
      paste(x[i:min(length(x), ((i+word_count)-1))], sep = " ", collapse = " ")
      }
    )
  }
)

# Rank ngrams based on the frequency of their occurence in sample string: 
# ordered_n_gram => character vector
ordered_ngram_count <- trimws(names(sort(table(unlist(n_grams)), decreasing = TRUE)), "both")

# Use levenshtein distance to determine similarity of revised_query 
# to the expressions in the ordered_ngram_count: lev_dist_df => data.frame
lev_dist_df <- setNames(data.frame(sapply(seq_along(sample_query), 
                                          function(i){
                                            adist(sample_query[i], ordered_ngram_count)
                                          })), gsub("\\s+", "_", sample_query))

# Example of applying function returning string element in sample string 
# with the minimum edit distance: sample_string element => stdout (console)
grep(grep(ordered_ngram_count[which.min(lev_dist_df[,1])], sample_string,
          value = TRUE), sample_string, value = TRUE)

数据：

sample_string <- c("Total - Main mode of commuting for the employed labour force aged 15 years and over in private households with a usual place of work or no fixed workplace address - 25% sample data", 
                   "Total - Language used most often at work for the population in private households aged 15 years and over who worked since January 1, 2015 - 25% sample data", 
                   "Number of market income recipients aged 15 years and over in private households - 25% sample data", 
                   "Number of employment income recipients aged 15 years and over in private households", 
                   "Total - Major field of study - Classification of Instructional Programs (CIP) 2016 for the population aged 15 years and over in private households - 25% sample data", 
                   "Total - Selected places of birth for the recent immigrant population in private households - 25% sample data", 
                   "Total - Commuting duration for the employed labour force aged 15 years and over in private households with a usual place of work or no fixed workplace address - 25% sample data", 
                   "Number of market income recipients aged 15 years and over in private households", 
                   "Employment income (%)", "Total - Aboriginal ancestry for the population in private households - 25% sample data", 
                   "Without employment income", "With after-tax income", "1 household maintainer", 
                   "Spending 30% or more of income on shelter costs", "Total - Highest certificate, diploma or degree for the population aged 25 to 64 years in private households - 25% sample data"
)

【讨论】：

谢谢，我喜欢这种方法。最后Error in ncol(lev_dist_mat) : object 'lev_dist_mat' not found 有一个错误，但我认为可以通过将lev_dist_mat 替换为您创建的lev_dist_df 对象来解决。我会投票，但会接受更长的时间，看看是否有任何其他使用不同方法的答案。谢谢！
@dshkol 好接你是对的，这意味着 lev_dist_df。不用担心，祝你的问题好运。如果没有其他解决方案符合您的规范，请接受该解决方案！