【问题标题】:Accessing words around a positional match in Lucene在 Lucene 中访问位置匹配周围的单词
【发布时间】:2014-09-12 18:20:35
【问题描述】:

给定文档中的术语匹配项,访问该匹配项周围的单词的最佳方法是什么?我读过这篇文章http://searchhub.org//2009/05/26/accessing-words-around-a-positional-match-in-lucene/, 但问题是自从这篇文章(2009)以来Lucene API完全改变了,有人可以指出我如何在新版本的Lucene中做到这一点,比如Lucene 4.6.1?

编辑

我现在明白了(发帖 API(TermEnum、TermDocsEnum、TermPositionsEnum)已被删除,取而代之的是新的灵活索引 (flex) API(Fields、FieldsEnum、Terms、TermEnum、DocsEnum、DocsAndPositionsEnum)。一个很大的区别是现在分别枚举字段和术语:TermEnum 在单个字段中为每个术语提供一个 BytesRef(包装一个字节 []),而不是一个术语。另一个是当请求 Docs/AndPositionsEnum 时,您现在指定显式地跳过文档(通常这将是已删除的文档,但通常您可以提供任何位)。):

public class TermVectorFun {
  public static String[] DOCS = {
    "The quick red fox jumped over the lazy brown dogs.",
    "Mary had a little lamb whose fleece was white as snow.",
    "Moby Dick is a story of a whale and a man obsessed.",
    "The robber wore a black fleece jacket and a baseball cap.",
    "The English Springer Spaniel is the best of all dogs.",
    "The fleece was green and red",
        "History looks fondly upon the story of the golden fleece, but most people don't agree"
  };

  public static void main(String[] args) throws IOException {
    RAMDirectory ramDir = new RAMDirectory();
    IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_46, new StandardAnalyzer(Version.LUCENE_46));
    config.setOpenMode(IndexWriterConfig.OpenMode.CREATE);
    //Index some made up content
    IndexWriter writer = new IndexWriter(ramDir, config);
    for (int i = 0; i < DOCS.length; i++) {
      Document doc = new Document();
      Field id = new Field("id", "doc_" + i, Field.Store.YES, Field.Index.NOT_ANALYZED_NO_NORMS);
      doc.add(id);
      //Store both position and offset information
      Field text = new Field("content", DOCS[i], Field.Store.NO, Field.Index.ANALYZED, Field.TermVector.WITH_POSITIONS_OFFSETS);
      doc.add(text);
      writer.addDocument(doc);
    }
    writer.close();
    //Get a searcher

    DirectoryReader dirReader = DirectoryReader.open(ramDir);
    IndexSearcher searcher = new IndexSearcher(dirReader);
    // Do a search using SpanQuery
    SpanTermQuery fleeceQ = new SpanTermQuery(new Term("content", "fleece"));
    TopDocs results = searcher.search(fleeceQ, 10);
    for (int i = 0; i < results.scoreDocs.length; i++) {
      ScoreDoc scoreDoc = results.scoreDocs[i];
      System.out.println("Score Doc: " + scoreDoc);
    }
    IndexReader reader = searcher.getIndexReader();
    Spans spans = fleeceQ.getSpans(reader.leaves().get(0), null, new LinkedHashMap<Term, TermContext>());
    int window = 2;//get the words within two of the match
    while (spans.next() == true) {
      int start = spans.start() - window;
      int end = spans.end() + window;
      Map<Integer, String> entries = new TreeMap<Integer, String>();

      System.out.println("Doc: " + spans.doc() + " Start: " + start + " End: " + end);
      Fields fields = reader.getTermVectors(spans.doc());
      Terms terms = fields.terms("content");

      TermsEnum termsEnum = terms.iterator(null);
      BytesRef text;
      while((text = termsEnum.next()) != null) {        
        //could store the BytesRef here, but String is easier for this example
        String s = new String(text.bytes, text.offset, text.length);
        DocsAndPositionsEnum positionsEnum = termsEnum.docsAndPositions(null, null);
        if (positionsEnum.nextDoc() != DocIdSetIterator.NO_MORE_DOCS) {
          int i = 0;
          int position = -1;
          while (i < positionsEnum.freq() && (position = positionsEnum.nextPosition()) != -1) {
            if (position >= start && position <= end) {
              entries.put(position, s);
            }
            i++;
          }
        }
      }
      System.out.println("Entries:" + entries);
    }
  }
}

【问题讨论】:

    标签: java lucene position posting


    【解决方案1】:

    使用HighlighterHighlighter.getBestFragment 可用于获取包含最佳匹配的部分文本。比如:

    TopDocs docs = searcher.search(query, maxdocs);
    Document firstDoc = search.doc(docs.scoreDocs[0].doc);
    
    Scorer scorer = new QueryScorer(query)
    Highlighter highlighter = new Highlighter(scorer);
    highlighter.GetBestFragment(myAnalyzer, fieldName, firstDoc.get(fieldName));
    

    【讨论】:

    • 谢谢,但我认为我不需要 Highighter 类来执行此操作。
    • 当然不是。如果您愿意,您可以自己通过返回的文档对您的术语进行线性搜索。但是您为什么不使用为此目的而设计的工具呢?
    • 是的,你是对的,我已经尝试过你的解决方案,甚至搜索文本都被阻止了。通过您的解决方案,我仍然可以得到匹配的单词,谢谢!
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2015-07-18
    • 2018-10-11
    • 2020-01-16
    • 1970-01-01
    • 2012-05-17
    相关资源
    最近更新 更多