Lucene：提高未排名的搜索性能？答案

【问题标题】：Lucene: Improving unranked search performance?Lucene：提高未排名的搜索性能？
【发布时间】：2016-07-30 19:23:31
【问题描述】：

我正在使用 Lucene 5.5.0 进行索引。以下标准描述了我的环境：

索引文档由 8 个字段组成。它们对于语料库中的所有文档都是相同的（所有文档都具有相同的“模式”）。
所有字段都是String 或Long 字段（因此不需要文本分析）。所有这些都是通过lucene存储的。字符串的最大长度为 255 个字符。
索引被视为“主要读取”，所有请求中有 90% 是（并发）读取。我正在应用程序级别进行锁定，因此 Lucene 不必担心并发读取和写入。
在搜索语料库时，我不需要对结果进行任何排名。检索到的文档结果的顺序可以完全是任意的。
查询通常是布尔、正则表达式和数字范围查询的组合。
在检索语料库时，检索与查询匹配的所有文档是重中之重。

我目前实现的search方法，封装了Lucene的API，如下所示：

public Set<Document> performLuceneSearch(Query query) {
        Set<Document> documents = Sets.newHashSet();
        // the reader instance is reused as often as possible, and exchanged
        // when a write occurs using DirectoryReader.openIfChanged(...).
        if (this.reader.numDocs() > 0) {
            // note that there cannot be a limiting number on the result set.
            // I absolutely need to retrieve ALL matching documents, so I have to
            // make use of 'reader.numDocs()' here.
            TopDocs topDocs = this.searcher.search(query, this.reader.numDocs());
            ScoreDoc[] scoreDocs = topDocs.scoreDocs;
            for (ScoreDoc scoreDoc : scoreDocs) {
                int documentId = scoreDoc.doc;
                Document document = this.reader.document(documentId);
                documents.add(document);
            }
        }
        return Collections.unmodifiableSet(documents);
}

考虑到我上面概述的环境，有什么方法可以更快/更好地做到这一点？特别是考虑到我不需要任何排名或排序（而是要求结果的完整性），我觉得应该有一些角落可以切入，让事情变得更快。

【问题讨论】：

标签： java performance search lucene

【解决方案1】：

您可以采取一些措施来加快搜索速度。首先，如果你不使用评分，你应该禁用规范，这会使索引更小。由于您只使用 StringField 和 LongField（而不是带有关键字标记器的 TextField），因此这些字段的规范被禁用，因此您已经拥有了。

其次，您应该构建和包装您的查询，以便最大限度地减少实际分数的计算。也就是说，如果您使用BooleanQuery，请使用Occur.FILTER 而不是Occur.MUST。两者具有相同的包含逻辑，但过滤器不得分。对于其他查询，请考虑将它们包装在 ConstantScoreQuery 中。但是，这可能根本没有必要（解释如下）。

第三，使用自定义Collector。默认搜索方法适用于小型、排名或排序的结果集，但您的用例不适合该模式。这是一个示例实现：

import org.apache.lucene.document.Document;
import org.apache.lucene.index.LeafReader;
import org.apache.lucene.index.LeafReaderContext;
import org.apache.lucene.search.SimpleCollector;

import java.io.IOException;
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;


final class AllDocumentsCollector extends SimpleCollector {

  private final List<Document> documents;
  private LeafReader currentReader;

  public AllDocumentsCollector(final int numDocs) {
    this.documents = new ArrayList<>(numDocs);
  }

  public List<Document> getDocuments() {
    return Collections.unmodifiableList(documents);
  }

  @Override
  protected void doSetNextReader(final LeafReaderContext context) {
    currentReader = context.reader();
  }

  @Override
  public void collect(final int doc) throws IOException {
    documents.add(currentReader.document(doc));
  }

  @Override
  public boolean needsScores() {
    return false;
  }
}

你会这样使用它。

public List<Document> performLuceneSearch(final Query query) throws IOException {
  // the reader instance is reused as often as possible, and exchanged
  // when a write occurs using DirectoryReader.openIfChanged(...).
  final AllDocumentsCollector collector = new AllDocumentsCollector(this.reader.numDocs());
  this.searcher.search(query, collector);
  return collector.getDocuments();
}

收集器使用列表而不是集合。 Document 没有实现 equals 或 hashCode，因此您不会从集合中获利，而只需支付额外的相等检查费用。最后的顺序是所谓的索引顺序。第一个文档将是索引中的第一个文档（如果您没有自定义合并策略，大致是插入顺序，但最终它是一个不保证稳定或可靠的任意顺序）。此外，收集器发出不需要分数的信号，这给您带来的好处与使用上面的选项 2 大致相同，因此您可以省去一些麻烦，只需将查询保留原样。

根据您需要 Documents 的用途，您可以通过使用 DocValues 而不是存储字段来获得更大的加速。仅当您只需要一个或两个字段而不是全部字段时，这才是正确的。经验法则是，对于少数文档但有很多字段，使用存储字段；对于许多文档但很少有字段，请使用 DocValues。无论如何，您应该进行试验——8 个字段并不多，您可能会为所有字段获利。以下是在索引过程中使用 DocValues 的方法：

import org.apache.lucene.document.Field;
import org.apache.lucene.document.LongField;
import org.apache.lucene.document.NumericDocValuesField;
import org.apache.lucene.document.SortedDocValuesField;
import org.apache.lucene.document.StringField;
import org.apache.lucene.util.BytesRef;

document.add(new StringField(fieldName, stringContent, Field.Store.NO));
document.add(new SortedDocValuesField(fieldName, new BytesRef(stringContent)));
// OR
document.add(new LongField(fieldName, longValue, Field.Store.NO));
document.add(new NumericDocValuesField(fieldName, longValue));

字段名可以相同，如果您可以完全依赖 DocValues，则可以选择不存储其他字段。收集器必须更改，以一个字段为例：

import org.apache.lucene.index.LeafReaderContext;
import org.apache.lucene.index.SortedDocValues;
import org.apache.lucene.search.SimpleCollector;

import java.io.IOException;
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;


final class AllDocumentsCollector extends SimpleCollector {

  private final List<String> documents;
  private final String fieldName;
  private SortedDocValues docValues;

  public AllDocumentsCollector(final String fieldName, final int numDocs) {
    this.fieldName = fieldName;
    this.documents = new ArrayList<>(numDocs);
  }

  public List<String> getDocuments() {
    return Collections.unmodifiableList(documents);
  }

  @Override
  protected void doSetNextReader(final LeafReaderContext context) throws IOException {
    docValues = context.reader().getSortedDocValues(fieldName);
  }

  @Override
  public void collect(final int doc) throws IOException {
    documents.add(docValues.get(doc).utf8ToString());
  }

  @Override
  public boolean needsScores() {
    return false;
  }
}

您可以分别对长字段使用getNumericDocValues。您必须为您必须加载的所有字段重复此操作（当然在同一个收集器中），最重要的是：衡量何时从存储的字段加载完整文档而不是使用 DocValues 更好。

最后一点：

我在应用层做锁，所以Lucene不用担心并发读写。

IndexSearcher 和 IndexWriter 本身已经是线程安全的。如果您只为 Lucene 锁定，您可以删除这些锁定并在所有线程之间共享它们。并考虑使用oal.search.SearcherManager 来重用IndexReader/Searcher。

【讨论】：

哇，这是一个全面的答案！非常感谢，我真的很感激！我肯定会尝试这些建议。我需要应用程序本身的锁定内容，所以我没有特别为 Lucene 实现它，但很高兴知道这些类本身是线程安全的。最后一个问题：使用 DocValuesFields 而不是存储字段需要更改持久文件格式，对吗？因为我已经在几个站点部署了我的代码，并且某些人会对破坏性更改感到不满。
是的，使用 DocValues 需要更改索引，从而有效地进行完整的重新索引。不过，DocValues 可以只添加到索引中（没有什么必须被删除），所以我可以想象一个迁移脚本可以在零停机时间的情况下做到这一点。总而言之，更有理由对此进行彻底测试并权衡性能与维护优势。