了解字符匹配和令牌匹配之间的区别答案

【问题标题】：Understanding the difference between character matching and token matching了解字符匹配和令牌匹配之间的区别
【发布时间】：2012-07-31 23:29:29
【问题描述】：

使用子字符串/字符匹配实现的搜索功能与使用基于标记的匹配实现的搜索功能有什么区别？我正在为产品规划目的寻找一种直观且不太技术性的解释。以下是我的解释，或多或少是正确的，但并不直观或完整。

答案与您选择的最小匹配单位有关：您是匹配单个字符，还是匹配单个单词？子串匹配示例：

"lady's hat".contains("l") == true
"lady's hat".contains("lady's hat") == true

而令牌匹配可能类似于：

Array("lady", "s", "hat").contains("l") == false
Array("lady", "s", "hat").contains("lady") == true
Array("lady", "s", "hat").contains("lady's hat") == false

显然，这是一种过于简单化的做法，并引出了很多问题。我想我可能试图在错误的抽象级别上解释问题。

上下文：我正在使用 Java 开发 Web 应用程序的搜索和过滤功能。我们当前的方法使用LIKE 运算符和MySQL，并且遭受全表扫描的明显性能问题。我们正在考虑使用 Lucene、Solr 或非规范化我们的关系数据。

【问题讨论】：

标签： java mysql solr lucene full-text-search

【解决方案1】：

字符匹配
字符匹配很昂贵，并且总是 O(NP)，其中 N=要搜索的字符数，P=要搜索的词条数。

这是线性搜索的伪代码：

function linear_search(items, terms, match_mode) {
  matched_items = new array();
  for(item_idx=0, item_count=count(items);item_idx < item_count;++item_idx) {
      doc = items[item_idx];
      match_count = 0;
      for(term_idx=0, term_count=count(terms);term_idx < term_count;++term_idx) {
          term = terms[term_idx];
          for(i=0, doc_len = strlen(doc), term_len = strlen(term) ; i < doc_len; i += term_len) {
              if(substr(doc, i, term_len) == term) { 
                  if(mode == 'ANY') {
                      matched_items[]=item_idx;
                      continue 3;
                  }  
                  ++match_count;
                  break;
              }
          }
       }
       if(mode == 'ALL' && (match_count == count(items)) matched_items[]=item_idx;
  }
  return matched_items;
}

必须使用线性搜索操作搜索每个文档（行），因此 N 实际上是数据集上的 sum(strlen(N))（每一行，也就是文档是 N）。 O(NP) 对于大型文档或许多文档或两者兼而有之，是一个非常长的操作。

倒排索引（令牌搜索）
另一方面，基于令牌的搜索将数据预解析为令牌并创建“倒排索引”。每个文档（要搜索的文本）首先被分解为术语，然后这些术语被索引并指向文档。
例如，给定原始数据：

1, the quick brown fox
2, all cows eat grass
3, all the world

创建倒排索引：

all => 2, 3
brown => 1
cows => 2
eat => 2
fox => 1
grass => 2
quick => 1
the => 1, 3
world => 3

通常在令牌上创建 b 树。因此，一个令牌的查找是 O(log(N)) 以获得与该令牌匹配的文档列表。获取实际数据的查找通常是另一个 O(log(N)) 操作。
假设 b-tree 结构，这会导致倒排索引操作成本：

O(log(TERM_COUNT)) + O(log(DOC_MATCH_COUNT))

词位分析
通常索引将存储文档中的单词位置，以及匹配的文档。这允许位置分析，例如 'bar' 附近的 'foo' 而无需查阅文档本身：

all => 2:1, 3:1
brown => 1:3
cows => 2:2
eat => 2:3
fox => 1:4
grass => 2:4
quick => 1:2
the => 1:1, 3:2
world => 3:3

词干
此外，“词干”，如 Porter Stemmer (http://en.wikipedia.org/wiki/Stemming) 通常应用于索引之前和搜索之前的术语。

词干分析器会将“品牌”和“品牌”等词转换为“品牌”，这样搜索品牌就会返回与品牌或品牌或品牌匹配的文档。

【讨论】：