在lucene中获取两个文档之间的余弦相似度答案

【问题标题】：get cosine similarity between two documents in lucene在lucene中获取两个文档之间的余弦相似度
【发布时间】：2010-12-23 02:03:17
【问题描述】：

我已经在 Lucene 中建立了一个索引。我想不指定查询，只是为了获得索引中两个文档之间的分数（余弦相似度或其他距离？）。

例如，我从以前打开的 IndexReader 中获取 id 为 2 和 4 的文档。文档 d1 = ir.document(2); 文档 d2 = ir.document(4);

如何获得这两个文档之间的余弦相似度？

谢谢

【问题讨论】：

【解决方案1】：

索引时，可以选择存储词频向量。

在运行时，使用 IndexReader.getTermFreqVector() 查找两个文档的词频向量，并使用 IndexReader.docFreq() 查找每个词的文档频率数据。这将为您提供计算两个文档之间的余弦相似度所需的所有组件。

一种更简单的方法可能是将文档 A 作为查询提交（将所有单词作为 OR 词添加到查询中，按词频提升每个词）并在结果集中查找文档 B。

【讨论】：

是的，首先，我使用 termfreqvector 来获得我想要的东西，但我想检查从 lucene 获得相似性的速度有多快。对于您答案的第二部分，我在 javadoc 中检查了没有明显的方法来获得相似度分数。好的，我可以在结果集中查找文档 B，但我唯一能得到的是它在 TopDocs 中的位置，而不是我想要的这两个文档向量之间的确切相似度分数。

【解决方案2】：

我知道问题已经得到解答，但是对于将来可能来到这里的人，可以在这里找到解决方案的好例子：

http://sujitpal.blogspot.ch/2011/10/computing-document-similarity-using.html

【讨论】：

【解决方案3】：

正如 Julia 指出的那样，Sujit Pal's example 非常有用但是 Lucene 4 API 有很大的变化。这是为 Lucene 4 重写的版本。

import java.io.IOException;
import java.util.*;

import org.apache.commons.math3.linear.*;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.core.SimpleAnalyzer;
import org.apache.lucene.document.*;
import org.apache.lucene.document.Field.Store;
import org.apache.lucene.index.*;
import org.apache.lucene.store.*;
import org.apache.lucene.util.*;

public class CosineDocumentSimilarity {

    public static final String CONTENT = "Content";

    private final Set<String> terms = new HashSet<>();
    private final RealVector v1;
    private final RealVector v2;

    CosineDocumentSimilarity(String s1, String s2) throws IOException {
        Directory directory = createIndex(s1, s2);
        IndexReader reader = DirectoryReader.open(directory);
        Map<String, Integer> f1 = getTermFrequencies(reader, 0);
        Map<String, Integer> f2 = getTermFrequencies(reader, 1);
        reader.close();
        v1 = toRealVector(f1);
        v2 = toRealVector(f2);
    }

    Directory createIndex(String s1, String s2) throws IOException {
        Directory directory = new RAMDirectory();
        Analyzer analyzer = new SimpleAnalyzer(Version.LUCENE_CURRENT);
        IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_CURRENT,
                analyzer);
        IndexWriter writer = new IndexWriter(directory, iwc);
        addDocument(writer, s1);
        addDocument(writer, s2);
        writer.close();
        return directory;
    }

    /* Indexed, tokenized, stored. */
    public static final FieldType TYPE_STORED = new FieldType();

    static {
        TYPE_STORED.setIndexed(true);
        TYPE_STORED.setTokenized(true);
        TYPE_STORED.setStored(true);
        TYPE_STORED.setStoreTermVectors(true);
        TYPE_STORED.setStoreTermVectorPositions(true);
        TYPE_STORED.freeze();
    }

    void addDocument(IndexWriter writer, String content) throws IOException {
        Document doc = new Document();
        Field field = new Field(CONTENT, content, TYPE_STORED);
        doc.add(field);
        writer.addDocument(doc);
    }

    double getCosineSimilarity() {
        return (v1.dotProduct(v2)) / (v1.getNorm() * v2.getNorm());
    }

    public static double getCosineSimilarity(String s1, String s2)
            throws IOException {
        return new CosineDocumentSimilarity(s1, s2).getCosineSimilarity();
    }

    Map<String, Integer> getTermFrequencies(IndexReader reader, int docId)
            throws IOException {
        Terms vector = reader.getTermVector(docId, CONTENT);
        TermsEnum termsEnum = null;
        termsEnum = vector.iterator(termsEnum);
        Map<String, Integer> frequencies = new HashMap<>();
        BytesRef text = null;
        while ((text = termsEnum.next()) != null) {
            String term = text.utf8ToString();
            int freq = (int) termsEnum.totalTermFreq();
            frequencies.put(term, freq);
            terms.add(term);
        }
        return frequencies;
    }

    RealVector toRealVector(Map<String, Integer> map) {
        RealVector vector = new ArrayRealVector(terms.size());
        int i = 0;
        for (String term : terms) {
            int value = map.containsKey(term) ? map.get(term) : 0;
            vector.setEntry(i++, value);
        }
        return (RealVector) vector.mapDivide(vector.getL1Norm());
    }
}

【讨论】：

VecTextField 是否取自 this 问题？
我正在使用 Sujit Pal 示例对此进行测试：文档#0：二尖瓣手术 - 微创 (31825) 文档#1：二尖瓣手术 - 开放 (31835) 文档#2：喉切除术 (31706 ) 但它有不同的结果！你能解释一下为什么吗？谢谢
@tiendv 您是如何获得 Sujit Pal 的文件的？他没有在他的网页上提供指向其内容的链接吗？他只是列出他们的头衔？如果您只是使用文档标题，您会得到很大的不同，因为这些文档标题非常不同。
是的，我知道了，我一直在检查这个。 Sujit Pal 的结果不正确
不，他的结果可能是正确的——我们不知道——他只是没有提供足够的信息来重复他的实验。

【解决方案4】：

这是 Mark Butler 的一个非常好的解决方案，但是 tf/idf 权重的计算是错误的！

Term-Frequency (tf)：该术语在本文档中出现的次数（不是所有文档，如带有 termsEnum.totalTermFreq() 的代码中的所有文档）。

文档频率 (df)：该词出现的文档总数。

逆文档频率：idf = log(N/df)，其中 N 是文档总数。

Tf/idf 权重 = tf * idf，对于给定的术语和给定的文档。

我希望使用 Lucene 进行高效计算！我无法找到正确的 if/idf 权重的有效计算方法。

编辑：我编写了这段代码来计算权重作为 tf/idf 权重，而不是纯词频。它工作得很好，但我想知道是否有更有效的方法。

import java.io.IOException;
import java.util.HashMap;
import java.util.HashSet;
import java.util.Map;
import java.util.Set;

import org.apache.commons.math3.linear.ArrayRealVector;
import org.apache.commons.math3.linear.RealVector;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.core.SimpleAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.FieldType;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.DocsEnum;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.Term;
import org.apache.lucene.index.Terms;
import org.apache.lucene.index.TermsEnum;
import org.apache.lucene.search.DocIdSetIterator;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.util.BytesRef;
import org.apache.lucene.util.Version;

public class CosineSimeTest {

    public static void main(String[] args) {
        try {
            CosineSimeTest cosSim = new 
                    CosineSimeTest( "This is good", 
                            "This is good" );
            System.out.println( cosSim.getCosineSimilarity() );
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    public static final String CONTENT = "Content";
    public static final int N = 2;//Total number of documents

    private final Set<String> terms = new HashSet<>();
    private final RealVector v1;
    private final RealVector v2;

    CosineSimeTest(String s1, String s2) throws IOException {
        Directory directory = createIndex(s1, s2);
        IndexReader reader = DirectoryReader.open(directory);
        Map<String, Double> f1 = getWieghts(reader, 0);
        Map<String, Double> f2 = getWieghts(reader, 1);
        reader.close();
        v1 = toRealVector(f1);
        System.out.println( "V1: " +v1 );
        v2 = toRealVector(f2);
        System.out.println( "V2: " +v2 );
    }

    Directory createIndex(String s1, String s2) throws IOException {
        Directory directory = new RAMDirectory();
        Analyzer analyzer = new SimpleAnalyzer(Version.LUCENE_CURRENT);
        IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_CURRENT,
                analyzer);
        IndexWriter writer = new IndexWriter(directory, iwc);
        addDocument(writer, s1);
        addDocument(writer, s2);
        writer.close();
        return directory;
    }

    /* Indexed, tokenized, stored. */
    public static final FieldType TYPE_STORED = new FieldType();

    static {
        TYPE_STORED.setIndexed(true);
        TYPE_STORED.setTokenized(true);
        TYPE_STORED.setStored(true);
        TYPE_STORED.setStoreTermVectors(true);
        TYPE_STORED.setStoreTermVectorPositions(true);
        TYPE_STORED.freeze();
    }

    void addDocument(IndexWriter writer, String content) throws IOException {
        Document doc = new Document();
        Field field = new Field(CONTENT, content, TYPE_STORED);
        doc.add(field);
        writer.addDocument(doc);
    }

    double getCosineSimilarity() {
        double dotProduct = v1.dotProduct(v2);
        System.out.println( "Dot: " + dotProduct);
        System.out.println( "V1_norm: " + v1.getNorm() + ", V2_norm: " + v2.getNorm() );
        double normalization = (v1.getNorm() * v2.getNorm());
        System.out.println( "Norm: " + normalization);
        return dotProduct / normalization;
    }


    Map<String, Double> getWieghts(IndexReader reader, int docId)
            throws IOException {
        Terms vector = reader.getTermVector(docId, CONTENT);
        Map<String, Integer> docFrequencies = new HashMap<>();
        Map<String, Integer> termFrequencies = new HashMap<>();
        Map<String, Double> tf_Idf_Weights = new HashMap<>();
        TermsEnum termsEnum = null;
        DocsEnum docsEnum = null;


        termsEnum = vector.iterator(termsEnum);
        BytesRef text = null;
        while ((text = termsEnum.next()) != null) {
            String term = text.utf8ToString();
            int docFreq = termsEnum.docFreq();
            docFrequencies.put(term, reader.docFreq( new Term( CONTENT, term ) ));

            docsEnum = termsEnum.docs(null, null);
            while (docsEnum.nextDoc() != DocIdSetIterator.NO_MORE_DOCS) {
                termFrequencies.put(term, docsEnum.freq());
            }

            terms.add(term);
        }

        for ( String term : docFrequencies.keySet() ) {
            int tf = termFrequencies.get(term);
            int df = docFrequencies.get(term);
            double idf = ( 1 + Math.log(N) - Math.log(df) );
            double w = tf * idf;
            tf_Idf_Weights.put(term, w);
            //System.out.printf("Term: %s - tf: %d, df: %d, idf: %f, w: %f\n", term, tf, df, idf, w);
        }

        System.out.println( "Printing docFrequencies:" );
        printMap(docFrequencies);

        System.out.println( "Printing termFrequencies:" );
        printMap(termFrequencies);

        System.out.println( "Printing if/idf weights:" );
        printMapDouble(tf_Idf_Weights);
        return tf_Idf_Weights;
    }

    RealVector toRealVector(Map<String, Double> map) {
        RealVector vector = new ArrayRealVector(terms.size());
        int i = 0;
        double value = 0;
        for (String term : terms) {

            if ( map.containsKey(term) ) {
                value = map.get(term);
            }
            else {
                value = 0;
            }
            vector.setEntry(i++, value);
        }
        return vector;
    }

    public static void printMap(Map<String, Integer> map) {
        for ( String key : map.keySet() ) {
            System.out.println( "Term: " + key + ", value: " + map.get(key) );
        }
    }

    public static void printMapDouble(Map<String, Double> map) {
        for ( String key : map.keySet() ) {
            System.out.println( "Term: " + key + ", value: " + map.get(key) );
        }
    }

}

【讨论】：

感谢您的反馈，但据我了解，您不需要计算 TF-IDF 来计算余弦相似度。如果需要，您可以使用 TF-IDF 计算相似度指标，但这不是上面代码的目的。具体来说，我使用上面的算法来测试一些自动提取代码在每个文档的基础上对一些人工生成的答案的效果如何。 TF-IDF 在这种情况下无济于事，这就是我没有使用它的原因。
另外我很高兴与您一起优化您的代码，我可以看到一些您可以做的基本事情，但如果您将它发布在一个新问题下会更好，因为这个问题没有提到 TF -以色列国防军？你总是可以引用这个问题吗？

【解决方案5】：

你可以找到更好的解决方案@http://darakpanand.wordpress.com/2013/06/01/document-comparison-by-cosine-methodology-using-lucene/#more-53。以下是步骤

java 代码在 Lucene 的帮助下从内容构建术语向量（检查：http://lucene.apache.org/core/）。
通过使用 commons-math.jar 库完成两个文档之间的余弦计算。

【讨论】：

尝试写更多的东西。不要只放链接。

【解决方案6】：

如果您不需要将文档存储到 Lucene 并且只想计算两个文档之间的相似度，这里是更快的代码（Scala，来自我的博客 http://chepurnoy.org/blog/2014/03/faster-cosine-similarity-between-two-dicuments-with-scala-and-lucene/）

def extractTerms(content: String): Map[String, Int] = {    
     val analyzer = new StopAnalyzer(Version.LUCENE_46)
     val ts = new EnglishMinimalStemFilter(analyzer.tokenStream("c", content))
     val charTermAttribute = ts.addAttribute(classOf[CharTermAttribute])

     val m = scala.collection.mutable.Map[String, Int]()

     ts.reset()
     while (ts.incrementToken()) {
         val term = charTermAttribute.toString
         val newCount = m.get(term).map(_ + 1).getOrElse(1)
         m += term -> newCount       
     }

     m.toMap
 }

def similarity(t1: Map[String, Int], t2: Map[String, Int]): Double = {
     //word, t1 freq, t2 freq
     val m = scala.collection.mutable.HashMap[String, (Int, Int)]()

     val sum1 = t1.foldLeft(0d) {case (sum, (word, freq)) =>
         m += word ->(freq, 0)
         sum + freq
     }

     val sum2 = t2.foldLeft(0d) {case (sum, (word, freq)) =>
         m.get(word) match {
             case Some((freq1, _)) => m += word ->(freq1, freq)
             case None => m += word ->(0, freq)
         }
         sum + freq
     }

     val (p1, p2, p3) = m.foldLeft((0d, 0d, 0d)) {case ((s1, s2, s3), e) =>
         val fs = e._2
         val f1 = fs._1 / sum1
         val f2 = fs._2 / sum2
         (s1 + f1 * f2, s2 + f1 * f1, s3 + f2 * f2)
     }

     val cos = p1 / (Math.sqrt(p2) * Math.sqrt(p3))
     cos
 }

所以，要计算 text1 和 text2 之间的相似度，只需调用 similarity(extractTerms(text1), extractTerms(text2))

【讨论】：

【解决方案7】：

在 Lucene 4.x 版本中计算余弦相似度与 3.x 不同。以下帖子详细解释了在 Lucene 4.10.2 中计算余弦相似度的所有必要代码。 ComputerGodzilla: Calculated Cosine Similarity in Lucene!

【讨论】：