RegEx 用于带错别字的全文搜索答案

【问题标题】：RegEx for fulltext search with typosRegEx 用于带错别字的全文搜索
【发布时间】：2013-02-02 02:46:01
【问题描述】：

我有一个包含以下列的 MySQL 表：

City      Country  Continent
New York  States   Noth America
New York  Germany  Europe - considering there's one ;)
Paris     France   Europe

如果我想找到有错字的“New Yokr”，使用 MySQL 存储函数很容易：

$querylev = "select City, Country, Continent FROM table 
            WHERE LEVENSHTEIN(`City`,'New Yokr') < 3"

但如果有两个纽约城市，用全文搜索你可以输入“纽约州”，你会得到你想要的结果。

所以问题是，我可以搜索“New Yokr Statse”并获得相同的结果吗？

是否有任何功能可以合并 levenshtein 和 fulltext 以形成一个多合一的解决方案，或者我应该在 MySQL 中创建一个连接 3 列的新列？

我知道还有其他解决方案，例如 lucene 或 Sphinx（还有 soundex、metaphone，但对此无效），但我认为对我来说实施它们可能有点困难。

【问题讨论】：

首先，你自己试过吗？我认为您无法同时获得这两者，因为 New Yokr Statse 与纽约州的距离为 4。
你自己试试是什么意思？我正在尝试不同的方法，但远未达到解决方案:(例如将每个单词拆分为标记并调用 levenshtein 距离，但为此我必须拆分每个单词，这似乎不是一个好的解决方案跨度>
我的意思是您问“我可以搜索“New Yokr Statse”并获得相同的结果吗？” - 一个简单的测试会告诉你不。但是你也可以说“我如何修改它以接受另一种情况” - 从帖子中很难说。在这个结构下，我自己没有单独通过 mysql 的答案。除了大量数据收集和用户行为记录之外，我不知道如何可靠地存储和引用错误类型。就像“你的意思是”功能，这就是它的样子。
'SELECT CONCAT(city, ' ', country, ' ',continent) full FROM table UNION SELECT CONCAT(city, ' ', country) full FROM table UNION 'SELECT City full FROM table WHERE LEVENSHTEIN(Full, search Term)
嗯，让我们试试吧。我会告诉你的，谢谢！还在寻找方法:)

标签： php regex full-text-search regex-group levenshtein-distance

【解决方案1】：

很好的问题，也是我们如何使用字符列表和正则表达式边界来设计查询和检索我们希望的数据的一个很好的例子。

根据我们可能想要的准确度和数据库中的数据，我们当然可以设计基于各种表达式的自定义查询，例如New York State 的示例，具有各种类型：

([new]+\s+[york]+\s+[stae]+)

这里，我们有三个字符列表，我们可以用其他可能的字母进行更新。

[new]
[york]
[stae]

我们还在这里添加了两组\s+作为我们的边界以提高准确性。

DEMO

这个 sn-p 只是显示了捕获组是如何工作的：

const regex = /([new]+\s+[york]+\s+[stae]+)/gmi;
const str = `Anything we wish to have before followed by a New York Statse then anything we wish to have after. Anything we wish to have before followed by a New  Yokr  State then anything we wish to have after. Anything we wish to have before followed by a New Yokr Stats then anything we wish to have after. Anything we wish to have before followed by a New York Statse then anything we wish to have after. `;
let m;

while ((m = regex.exec(str)) !== null) {
    // This is necessary to avoid infinite loops with zero-width matches
    if (m.index === regex.lastIndex) {
        regex.lastIndex++;
    }
    
    // The result can be accessed through the `m`-variable.
    m.forEach((match, groupIndex) => {
        console.log(`Found match, group ${groupIndex}: ${match}`);
    });
}

PHP

$re = '/([new]+\s+[york]+\s+[stae]+)/mi';
$str = 'Anything we wish to have before followed by a New York Statse then anything we wish to have after. Anything we wish to have before followed by a New  Yokr  State then anything we wish to have after. Anything we wish to have before followed by a New Yokr Stats then anything we wish to have after. Anything we wish to have before followed by a New York Statse then anything we wish to have after. ';

preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);

// Print the entire match result
var_dump($matches);

【讨论】：