【发布时间】:2020-09-18 15:48:19
【问题描述】:
总结:如何最有效地计算多个正则表达式匹配并按发生率对结果进行排名?是否应该使用语义方法来代替正则表达式?
用于说明的示例数据:
sample_string <- c("Total - Main mode of commuting for the employed labour force aged 15 years and over in private households with a usual place of work or no fixed workplace address - 25% sample data",
"Total - Language used most often at work for the population in private households aged 15 years and over who worked since January 1, 2015 - 25% sample data",
"Number of market income recipients aged 15 years and over in private households - 25% sample data",
"Number of employment income recipients aged 15 years and over in private households",
"Total - Major field of study - Classification of Instructional Programs (CIP) 2016 for the population aged 15 years and over in private households - 25% sample data",
"Total - Selected places of birth for the recent immigrant population in private households - 25% sample data",
"Total - Commuting duration for the employed labour force aged 15 years and over in private households with a usual place of work or no fixed workplace address - 25% sample data",
"Number of market income recipients aged 15 years and over in private households",
"Employment income (%)", "Total - Aboriginal ancestry for the population in private households - 25% sample data",
"Without employment income", "With after-tax income", "1 household maintainer",
"Spending 30% or more of income on shelter costs", "Total - Highest certificate, diploma or degree for the population aged 25 to 64 years in private households - 25% sample data"
)
还有一个包含多个词条的示例字符串查询
sample_query <- c("after tax income")
使用grepl 很容易检查字符串查询是否匹配。
sample_string[grepl(sample_query, sample_string)]
但显然这在这里行不通,因为没有完全匹配,因为实际术语是after-tax income。另一种方法是将搜索查询分成几部分并进行检查。
sample_string[grepl(paste(unlist(strsplit(sample_query, " +")), collapse = "|"), sample_string)]
这可行,但会返回太多结果,因为它匹配任何这些术语的任何实例。
[1] "Number of market income recipients aged 15 years and over in private households - 25% sample data"
[2] "Number of employment income recipients aged 15 years and over in private households"
[3] "Number of market income recipients aged 15 years and over in private households"
[4] "Employment income (%)"
[5] "Without employment income"
[6] "With after-tax income"
[7] "Spending 30% or more of income on shelter costs"
问题:如何根据单个匹配的数量有效地返回最接近的匹配?
应用一些答案here,并添加排序和匹配会导致怪物:
sample_string[grepl(paste(unlist(strsplit(sample_query, " +")),
collapse = "|"),
sample_string)][order(-lengths(regmatches(
sample_string[grepl(paste(unlist(strsplit(sample_query, " +")),
collapse = "|"),
sample_string)],
gregexpr(paste(unlist(
strsplit(sample_query, " +")
),
collapse = "|"),
sample_string[grepl(paste(unlist(strsplit(sample_query, " +")),
collapse = "|"),
sample_string)])
)))]
返回我想要的 - 包含至少一个匹配项的所有字符串的列表,按匹配项数排序。
[1] "With after-tax income"
[2] "Number of market income recipients aged 15 years and over in private households - 25% sample data"
[3] "Number of employment income recipients aged 15 years and over in private households"
[4] "Number of market income recipients aged 15 years and over in private households"
[5] "Employment income (%)"
[6] "Without employment income"
[7] "Spending 30% or more of income on shelter costs"
稍微清理一下上面的怪物:
to_match <- paste(unlist(strsplit(sample_query, " +")),collapse = "|")
results <- sample_string[grepl(to_match,sample_string)]
results[order(-lengths(regmatches(results,gregexpr(to_match,results))))]
我可以忍受这个,但有没有办法让它更简洁?而且,我想知道这是否是解决此问题的最佳方法?
我知道stringr::str_count 和stringi::stri_count_regex。这是一个包,我试图避免添加额外的依赖项,但如果这些更有效,我可以改用它。
或者,替代字符串距离是更好的选择吗?检查数千个长字符串时会更好吗?
目的是帮助用户找到相关信息,也许有一些更面向语义的东西是有意义的。
【问题讨论】:
-
由于您的代码已经在运行,您可以尝试将您的问题迁移到Code Review。
-
@dshkol 请看我的回复,它是在基础 R 中完成的,它使用 levenshtein(编辑)距离并检索单个最相关的句子。