【发布时间】:2015-05-24 21:05:58
【问题描述】:
我正在尝试应用一个 Java 类来测量两个不同长度的文档之间的余弦相似度。负责计算此代码的类的代码如下:
public class CosineSimilarityy {
public Double calculateCosineSimilarity(HashMap<String, Double> firstFeatures, HashMap<String, Double> secondFeatures) {
Double similarity = 0.0;
Double sum = 0.0; // the numerator of the cosine similarity
Double fnorm = 0.0; // the first part of the denominator of the cosine similarity
Double snorm = 0.0; // the second part of the denominator of the cosine similarity
Set<String> fkeys = firstFeatures.keySet();
Iterator<String> fit = fkeys.iterator();
while (fit.hasNext()) {
String featurename = fit.next();
boolean containKey = secondFeatures.containsKey(featurename);
if (containKey) {
sum = sum + firstFeatures.get(featurename) * secondFeatures.get(featurename);
}
}
fnorm = calculateNorm(firstFeatures);
snorm = calculateNorm(secondFeatures);
similarity = sum / (fnorm * snorm);
return similarity;
}
/**
* calculate the norm of one feature vector
*
* @param feature of one cluster
* @return
*/
public Double calculateNorm(HashMap<String, Double> feature) {
Double norm = 0.0;
Set<String> keys = feature.keySet();
Iterator<String> it = keys.iterator();
while (it.hasNext()) {
String featurename = it.next();
norm = norm + Math.pow(feature.get(featurename), 2);
}
return Math.sqrt(norm);
}
}
然后我构造这个类的一个实例,创建两个HashMap 并将每个文档分配给这些hasmaps。然后,当我尝试应用计算时,如果它们相同,则结果为 1.0,这是正确的,但如果它们之间有任何细微差异,无论如何,结果始终为零。我错过了什么?
public static void main(String[] args) {
// TODO code application logic here
CosineSimilarityy test = new CosineSimilarityy();
HashMap<String, Double> hash = new HashMap<>();
HashMap<String, Double> hash2 = new HashMap<>();
hash.put("i am a book", 1.0);
hash2.put("you are a book", 2.0);
double result;
result = test.calculateCosineSimilarity(hash, hash2);
System.out.println(" this is the result: " + result);
}
原代码取自here。
【问题讨论】:
-
您在函数中输入了两个不同的特征,这将始终导致相似度为零。
-
@ThomasJungblut 但是为什么什么时候相同,结果为 1?另外,该函数需要两个
HasMaps。那么,如果我做错了,如何解决? -
but then why when are the same, it results to 1?好吧,您想计算相似度,如果它们相同,则为 1。 -
@ThomasJungblut 刚才您说它们是两个不同的功能,因此得到零。然而,根据我从余弦相似度的想象,它应该给出一个介于零和一之间的
real number结果。我错了吗? -
那么您必须提供
a和book作为类似功能。"i", "am", "a", "book"是与"I am a book"不同的表示形式。该方法应该如何知道您的意思是按单词拆分?
标签: java hashmap cosine-similarity