文本 R 中的近似匹配和替换答案

【问题标题】：Approximate Matching and Replacement in Text R文本 R 中的近似匹配和替换
【发布时间】：2015-10-28 19:19:40
【问题描述】：

我有一个句子我只想用数字替换字符串的一部分。如果我们有一个完全匹配的 gsub 函数就可以完美地工作。

gsub('great thing', 5555 ,c('hey this is a great thing'))
gsub('good rabbit', 5555 ,c('hey this is a good rabbit in the field'))

但现在我遇到了以下问题。如果字符串的一部分有错误，如何将模糊匹配函数应用于字符串？

gsub('great thing', 5555 ,c('hey this is a graet thing'))
gsub('good rabbit', 5555 ,c('hey this is a goood rabit in the field'))

算法应该找出“great thing”和“graet thing”或“good rabbit”和“good rabit”非常相似，应该用数字5555替换。最好我们可以使用Jaro Winkler距离来在字符串中找到近似匹配，然后替换近似子字符串。我需要一个非常抽象的算法来做到这一点。

有什么想法吗？

【问题讨论】：

或许gsub('gr[ae][ae]t thing', 5555 ,c('hey this is a graet thing'))
agrep 赢得胜利！
嗯，我正在考虑使用 jaro winkler 距离应用模糊匹配算法。这可能吗？
你可以查看library(stringdist)它有一些选项。
@akrun 但是如果它是“很棒的东西”呢？您不能为每种可能性编写正则表达式条件。

标签： r text matching fuzzy-search

【解决方案1】：

一些agrep的例子：

agrep("lasy", "1 lazy 2")
agrep("lasy", "1 lazy 2", max = list(sub = 0))
agrep("laysy", c("1 lazy", "1", "1 LAZY"), max = 2)
agrep("laysy", c("1 lazy", "1", "1 LAZY"), max = 2, value = TRUE)
agrep("laysy", c("1 lazy", "1", "1 LAZY"), max = 2, ignore.case = TRUE)

agrep 在基数中。如果你加载stringdist，你可以使用Jarro-Winkler 和stringdist 来计算字符串距离，或者如果你很懒，你可以使用ain 或amatch。出于我的目的，我倾向于更多地使用 Damerau–Levenshtein (method="dl")，但您的里程可能会有所不同。

请务必在使用前仔细阅读算法参数的工作原理（即将您的 p、q 和 maxDist 值设置为对您正在做的事情有意义的水平）

【讨论】：

如果检测到相似性，我该如何替换子字符串？
这更难。我可能会使用一些待搜索字符串的 n-gram 并将其与目标字符串进行比较，计算 Levenshtein 距离。跟踪哪个 n-gram 是哪个，并使用 gsub 开关重构向量。抱歉，想不出已经包装好的包裹。也许是 tm 或 qdap？
或查看我提出的类似问题的答案：stackoverflow.com/questions/31843171/…
谢谢！这对我很有帮助。