【发布时间】:2016-04-07 11:21:44
【问题描述】:
如果我有两个文档,其中 D1 有两次“lucene”一词,而 D2 有三次“lucene”一词。我希望 lucene 的 D2 得分高于 D1。这里需要注意的是,D1 只有两个词(即 lucene lucene),而 D3 有 100 个词,其中 3 个词是 lucene。默认 lucene 评分模型将 D1 评分高于 D2。我想禁用此模式并将 D2 排名高于 D1。这是我的项目要求。
【问题讨论】:
标签: lucene
如果我有两个文档,其中 D1 有两次“lucene”一词,而 D2 有三次“lucene”一词。我希望 lucene 的 D2 得分高于 D1。这里需要注意的是,D1 只有两个词(即 lucene lucene),而 D3 有 100 个词,其中 3 个词是 lucene。默认 lucene 评分模型将 D1 评分高于 D2。我想禁用此模式并将 D2 排名高于 D1。这是我的项目要求。
【问题讨论】:
标签: lucene
您需要实现一个相似度来满足您的需求。您可以直接在Similarity 上实现,但您可能会发现复制ClassicSimilarity(DefaultSimilarity,5.4 之前的版本)更简单,并删除您不想影响分数的事情(即。返回一个常数)。例如,这是一个非常简单的实现,它会简单地返回查询中术语的频率:
import org.apache.lucene.index.FieldInvertState;
import org.apache.lucene.search.similarities.TFIDFSimilarity;
import org.apache.lucene.util.BytesRef;
public class SimpleSimilarity extends TFIDFSimilarity {
//Comments describe briefly what these methods do in the *standard* implementation.
//Not what they do in this implementation (which, for most of them, is nothing at all)
public SimpleSimilarity() {}
//boosts results which match more query terms
@Override
public float coord(int overlap, int maxOverlap) {
return 1f;
}
//constant per query, normalizes scores somewhat based on query
@Override
public float queryNorm(float sumOfSquaredWeights) {
return 1f;
}
//Norms should be disabled when using this similarity
//They are useless to it, and would just be wasted space.
@Override
public final long encodeNormValue(float f) {
return 1L;
}
@Override
public final float decodeNormValue(long norm) {
return 1f;
}
//Weighs shorter fields more heavily
@Override
public float lengthNorm(FieldInvertState state) {
return 1f;
}
//Higher frequency terms (more matches) scored higher
@Override
public float tf(float freq) {
//return (float)Math.sqrt(freq); The standard tf impl
return freq;
}
//Scores closer matches higher when using a sloppy phrase query
@Override
public float sloppyFreq(int distance) {
return 1.0f;
}
//ClassicSimilarity doesn't really do much with payloads. This is unmodified
@Override
public float scorePayload(int doc, int start, int end, BytesRef payload) {
return 1f;
}
//Weigh matches on rarer terms more heavily.
@Override
public float idf(long docFreq, long numDocs) {
return 1f;
}
@Override
public String toString() {
return "SimpleSimilarity";
}
}
【讨论】: