如何匹配一个字符容差的字符串？答案

【问题标题】：How to match a string with a tolerance of one character?如何匹配一个字符容差的字符串？
【发布时间】：2016-06-01 03:35:11
【问题描述】：

我有一个位置向量，我试图消除与正确位置名称向量的歧义。对于这个例子，我只使用了两个明确的位置：

agrepl('Au', c("Austin, TX", "Houston, TX"), 
max.distance =  .000000001, 
ignore.case = T, fixed = T)
[1] TRUE TRUE

帮助页面说max.distance是

一场比赛允许的最大距离。表示为整数或模式长度乘以最大转换成本的分数

我不确定 Levensthein 距离的数学含义；我的理解是距离越小，对我的消歧字符串向量的不匹配容忍度就越严格。

所以我会调整它以检索两个FALSE？基本上我只想有一个TRUE 只有当有1个字符的差异时：

agrepl('Austn, TX', "Austin, TX", 
max.distance =  .000000001, ignore.case = T, fixed = T)
[1] TRUE

【问题讨论】：

改用adist。问题是您发生了部分匹配，因此Au 立即匹配*Au*stin。例如，adist(c("Au","Austn, TX"), c("Austin, TX", "Houston, TX"), partial=FALSE)
如果您传递max.distance 一个整数，它会将其用作允许更改的数量而不是比例。您还可以将特定类型更改的命名限制列表传递给它，例如agrepl('Au', c('Austin, TX', 'Houston, TX'), max.distance = c(costs = 1, insertions = 0, deletions = 1, substitutions = 0), ignore.case = T, fixed = T)。请参阅?agrep 了解更多信息。
@thelatemail 谢谢，我应该编写一个函数来获取差异最小的字符串，还是有任何特定的方法来检索值而不是基于自定义阈值的距离？ @ alistaire 这就是我的想法，但是如果您检查一下，您会发现“Au”与“Austin, TX”匹配，我不想这样做。

标签： r agrep

【解决方案1】：

您遇到的问题可能与我在这里开始实验时遇到的问题相似。当fixed=TRUE 时，第一个参数是正则表达式模式，因此如果不限制为完整字符串，小模式是非常宽松的。帮助页面甚至有关于该问题的“注释”：

由于有人不小心阅读了描述，甚至提交了错误报告，请注意这匹配 x 的每个元素的子字符串（就像 grep 一样）而不是整个元素。

使用正则表达式模式，您可以在 pattern 字符串两侧加上“^”和“$”，因为与 adist 不同，agrepl 没有部分参数：

> agrepl('^Au$', "Austin, TX", 
+ max.distance =  c(insertions=.15),  ignore.case = T, fixed=FALSE)
[1] FALSE
> agrepl('^Austn, TX$', "Austin, TX", 
+ max.distance =  c(insertions=.15),  ignore.case = T, fixed=FALSE)
[1] TRUE
> agrepl('^Austn, T$', "Austin, TX", 
+ max.distance =  c(insertions=.15),  ignore.case = T, fixed=FALSE)
[1] FALSE

所以你需要用这些侧翼粘贴0：

> agrepl( paste0('^', 'Austn, Tx', '$'), "Austin, TX", 
+ max.distance =  c(insertions=.15),  ignore.case = T, fixed=FALSE)
[1] TRUE
> agrepl( paste0('^', 'Au', '$'), "Austin, TX", 
+ max.distance =  c(insertions=.15),  ignore.case = T, fixed=FALSE)
[1] FALSE

使用all 可能比只使用insertions 更好，并且您可能希望降低分数。

【讨论】：