【发布时间】:2014-09-12 18:20:35
【问题描述】:
给定文档中的术语匹配项,访问该匹配项周围的单词的最佳方法是什么?我读过这篇文章http://searchhub.org//2009/05/26/accessing-words-around-a-positional-match-in-lucene/, 但问题是自从这篇文章(2009)以来Lucene API完全改变了,有人可以指出我如何在新版本的Lucene中做到这一点,比如Lucene 4.6.1?
编辑:
我现在明白了(发帖 API(TermEnum、TermDocsEnum、TermPositionsEnum)已被删除,取而代之的是新的灵活索引 (flex) API(Fields、FieldsEnum、Terms、TermEnum、DocsEnum、DocsAndPositionsEnum)。一个很大的区别是现在分别枚举字段和术语:TermEnum 在单个字段中为每个术语提供一个 BytesRef(包装一个字节 []),而不是一个术语。另一个是当请求 Docs/AndPositionsEnum 时,您现在指定显式地跳过文档(通常这将是已删除的文档,但通常您可以提供任何位)。):
public class TermVectorFun {
public static String[] DOCS = {
"The quick red fox jumped over the lazy brown dogs.",
"Mary had a little lamb whose fleece was white as snow.",
"Moby Dick is a story of a whale and a man obsessed.",
"The robber wore a black fleece jacket and a baseball cap.",
"The English Springer Spaniel is the best of all dogs.",
"The fleece was green and red",
"History looks fondly upon the story of the golden fleece, but most people don't agree"
};
public static void main(String[] args) throws IOException {
RAMDirectory ramDir = new RAMDirectory();
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_46, new StandardAnalyzer(Version.LUCENE_46));
config.setOpenMode(IndexWriterConfig.OpenMode.CREATE);
//Index some made up content
IndexWriter writer = new IndexWriter(ramDir, config);
for (int i = 0; i < DOCS.length; i++) {
Document doc = new Document();
Field id = new Field("id", "doc_" + i, Field.Store.YES, Field.Index.NOT_ANALYZED_NO_NORMS);
doc.add(id);
//Store both position and offset information
Field text = new Field("content", DOCS[i], Field.Store.NO, Field.Index.ANALYZED, Field.TermVector.WITH_POSITIONS_OFFSETS);
doc.add(text);
writer.addDocument(doc);
}
writer.close();
//Get a searcher
DirectoryReader dirReader = DirectoryReader.open(ramDir);
IndexSearcher searcher = new IndexSearcher(dirReader);
// Do a search using SpanQuery
SpanTermQuery fleeceQ = new SpanTermQuery(new Term("content", "fleece"));
TopDocs results = searcher.search(fleeceQ, 10);
for (int i = 0; i < results.scoreDocs.length; i++) {
ScoreDoc scoreDoc = results.scoreDocs[i];
System.out.println("Score Doc: " + scoreDoc);
}
IndexReader reader = searcher.getIndexReader();
Spans spans = fleeceQ.getSpans(reader.leaves().get(0), null, new LinkedHashMap<Term, TermContext>());
int window = 2;//get the words within two of the match
while (spans.next() == true) {
int start = spans.start() - window;
int end = spans.end() + window;
Map<Integer, String> entries = new TreeMap<Integer, String>();
System.out.println("Doc: " + spans.doc() + " Start: " + start + " End: " + end);
Fields fields = reader.getTermVectors(spans.doc());
Terms terms = fields.terms("content");
TermsEnum termsEnum = terms.iterator(null);
BytesRef text;
while((text = termsEnum.next()) != null) {
//could store the BytesRef here, but String is easier for this example
String s = new String(text.bytes, text.offset, text.length);
DocsAndPositionsEnum positionsEnum = termsEnum.docsAndPositions(null, null);
if (positionsEnum.nextDoc() != DocIdSetIterator.NO_MORE_DOCS) {
int i = 0;
int position = -1;
while (i < positionsEnum.freq() && (position = positionsEnum.nextPosition()) != -1) {
if (position >= start && position <= end) {
entries.put(position, s);
}
i++;
}
}
}
System.out.println("Entries:" + entries);
}
}
}
【问题讨论】:
标签: java lucene position posting