计算余弦相似度答案

【问题标题】：Calculating cosine similarity计算余弦相似度
【发布时间】：2015-05-24 21:05:58
【问题描述】：

我正在尝试应用一个 Java 类来测量两个不同长度的文档之间的余弦相似度。负责计算此代码的类的代码如下：

public class CosineSimilarityy {
    public Double calculateCosineSimilarity(HashMap<String, Double> firstFeatures, HashMap<String, Double> secondFeatures) {
        Double similarity = 0.0;
        Double sum = 0.0; // the numerator of the cosine similarity
        Double fnorm = 0.0; // the first part of the denominator of the cosine similarity
        Double snorm = 0.0; // the second part of the denominator of the cosine similarity
        Set<String> fkeys = firstFeatures.keySet();
        Iterator<String> fit = fkeys.iterator();
        while (fit.hasNext()) {
            String featurename = fit.next();
            boolean containKey = secondFeatures.containsKey(featurename);
            if (containKey) {
                sum = sum + firstFeatures.get(featurename) * secondFeatures.get(featurename);
            }
        }
        fnorm = calculateNorm(firstFeatures);
        snorm = calculateNorm(secondFeatures);
        similarity = sum / (fnorm * snorm);
        return similarity;
    }

    /**
     * calculate the norm of one feature vector
     *
     * @param feature of one cluster
     * @return
     */
    public Double calculateNorm(HashMap<String, Double> feature) {
        Double norm = 0.0;
        Set<String> keys = feature.keySet();
        Iterator<String> it = keys.iterator();
        while (it.hasNext()) {
            String featurename = it.next();
            norm = norm + Math.pow(feature.get(featurename), 2);
        }
        return Math.sqrt(norm);
    }
}

然后我构造这个类的一个实例，创建两个HashMap 并将每个文档分配给这些hasmaps。然后，当我尝试应用计算时，如果它们相同，则结果为 1.0，这是正确的，但如果它们之间有任何细微差异，无论如何，结果始终为零。我错过了什么？

public static void main(String[] args) {
    // TODO code application logic here

    CosineSimilarityy test = new CosineSimilarityy();
    HashMap<String, Double> hash = new HashMap<>();
    HashMap<String, Double> hash2 = new HashMap<>();
    hash.put("i am a book", 1.0);
    hash2.put("you are a book", 2.0);
    double result;
    result = test.calculateCosineSimilarity(hash, hash2);
    System.out.println(" this is the result: " + result);
}

原代码取自here。

【问题讨论】：

您在函数中输入了两个不同的特征，这将始终导致相似度为零。
@ThomasJungblut 但是为什么什么时候相同，结果为 1？另外，该函数需要两个HasMaps。那么，如果我做错了，如何解决？
but then why when are the same, it results to 1? 好吧，您想计算相似度，如果它们相同，则为 1。
@ThomasJungblut 刚才您说它们是两个不同的功能，因此得到零。然而，根据我从余弦相似度的想象，它应该给出一个介于零和一之间的real number 结果。我错了吗？
那么您必须提供a 和book 作为类似功能。 "i", "am", "a", "book" 是与 "I am a book" 不同的表示形式。该方法应该如何知道您的意思是按单词拆分？

标签： java hashmap cosine-similarity

【解决方案1】：

首先，我认为“我是一本书”是一个单一的特征。要进行比较，您必须首先使用空格作为分隔符来拆分比较的字符串。接下来，您必须使用从书名中提取的相应单词填充哈希图。然后，您可以测试您的算法是否正常工作。

How do i split a string with any whitespace chars as delimiters?

Cosine similiarity wikipedia

【讨论】：

您的意思是我必须先将每个string 分解为字符并将它们放入HashMaps，然后计算这两个HashMaps 之间的相似性吗？
是的，我的意思是我会更新答案。您使用什么作为功能的双重价值？这里有一个提示，它应该是整个文档的词频：en.wikipedia.org/wiki/Cosine_similarity
老实说，我对类所具有的 HashMap 的 Double 部分感到困惑。事实上，找到了我想要的余弦相似度here，但这是在 python 中而不是在 java 中。所以我搜索了一个java版本并得到了已经发布在这里的代码。但似乎它有问题，我怀疑这是否适用于不同长度的字符串？对吗？
我猜你是对的，应该将整个字符串拆分为字符，然后将 put 拆分为 HashMaps，然后即使长度不同，它也可以正常工作。作为最后一个问题，将同义词视为相同的特征是否明智？我的意思是，例如，i am a cook 和 i am a chef 之间的相似性必须为 1？
好主意，如果您使用同义词，您可能会发现更多相关内容和更多不相关内容，例如来自谷歌翻译的有趣翻译。