【问题标题】:I need a way to score the lucene documents using term frequency only. Is there any flag that needs to be changed for this?我需要一种仅使用词频对 lucene 文档进行评分的方法。是否有任何标志需要为此更改?
【发布时间】:2016-04-07 11:21:44
【问题描述】:

如果我有两个文档,其中 D1 有两次“lucene”一词,而 D2 有三次“lucene”一词。我希望 lucene 的 D2 得分高于 D1。这里需要注意的是,D1 只有两个词(即 lucene lucene),而 D3 有 100 个词,其中 3 个词是 lucene。默认 lucene 评分模型将 D1 评分高于 D2。我想禁用此模式并将 D2 排名高于 D1。这是我的项目要求。

【问题讨论】:

    标签: lucene


    【解决方案1】:

    您需要实现一个相似度来满足您的需求。您可以直接在Similarity 上实现,但您可能会发现复制ClassicSimilarityDefaultSimilarity,5.4 之前的版本)更简单,并删除您不想影响分数的事情(即。返回一个常数)。例如,这是一个非常简单的实现,它会简单地返回查询中术语的频率:

    import org.apache.lucene.index.FieldInvertState;
    import org.apache.lucene.search.similarities.TFIDFSimilarity;
    import org.apache.lucene.util.BytesRef;
    
    public class SimpleSimilarity extends TFIDFSimilarity {
    //Comments describe briefly what these methods do in the *standard* implementation.
    //Not what they do in this implementation (which, for most of them, is nothing at all)
    
      public SimpleSimilarity() {}
    
      //boosts results which match more query terms
      @Override
      public float coord(int overlap, int maxOverlap) {
        return 1f;
      }
    
      //constant per query, normalizes scores somewhat based on query
      @Override
      public float queryNorm(float sumOfSquaredWeights) {
        return 1f;
      }
    
      //Norms should be disabled when using this similarity
      //They are useless to it, and would just be wasted space.
      @Override
      public final long encodeNormValue(float f) {
        return 1L;
      }
    
      @Override
      public final float decodeNormValue(long norm) {
        return 1f;
      }
    
      //Weighs shorter fields more heavily
      @Override
      public float lengthNorm(FieldInvertState state) {
        return 1f;
      }
    
      //Higher frequency terms (more matches) scored higher
      @Override
      public float tf(float freq) {
        //return (float)Math.sqrt(freq);  The standard tf impl
        return freq;
      }
    
      //Scores closer matches higher when using a sloppy phrase query
      @Override
      public float sloppyFreq(int distance) {
        return 1.0f;
      }
    
      //ClassicSimilarity doesn't really do much with payloads.  This is unmodified
      @Override
      public float scorePayload(int doc, int start, int end, BytesRef payload) {
        return 1f;
      }
    
      //Weigh matches on rarer terms more heavily.
      @Override
      public float idf(long docFreq, long numDocs) {
        return 1f;
      }
    
      @Override
      public String toString() {
        return "SimpleSimilarity";
      }
    }
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2021-08-10
      • 2017-11-06
      • 2010-11-18
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多