【发布时间】:2020-12-29 06:10:09
【问题描述】:
我正在经历一个项目,我必须针对查询一一找到相关文档。首先我计算了所有文档的所有单词的TF,IDF。然后我将 TF 和 IDF 相乘,并将特定文档的每个术语及其对应的 TF-IDF 分数存储在 List 中。这里名为 Tfidf 的类计算 TF 和 IDF
public double TF(String[] document, String term) {
double value = 0; //calculate Term Frequency for all term
for (String s : document) {
if (s.equalsIgnoreCase(term)) {
tfmap.put(s, tfmap.getOrDefault(term, 0) + 1);
for (Map.Entry entry : tfmap.entrySet()) {
value = (int) entry.getValue();
}
}
}
return value / document.length;
}
public double idf(List alldocument, String term) {
double b = alldocument.size();
double count = 0;
for (int i = 0; i < alldocument.size(); i++) {
String[] f = alldocument.get(i).toString().replaceAll("[^a-zA-Z0-9 ]", " ").trim().replaceAll(" +", " ").toLowerCase().split(" ");
for (String ss : f) {
if (ss.equalsIgnoreCase(term)) {
count++;
break;
}
}
}
return 1 + Math.log(b / count);
}}
这里是我乘以 TF 和 IDF 的代码
List<String> alldocument= new ArrayList<>();
List tfidfVector = new ArrayList<>();
public void TfIdf() {
double tf;
double idf;
double tfidf = 0;
for (int i = 0; i < alldocument.size(); i++) {
double[] tfidfvector = new double[allterm.size()]; //allterm is all unique word in all documents
for (String terms : allterm) {
String[] file = alldocument.get(i).replaceAll("[^a-zA-Z0-9 ]", " ").trim().replaceAll(" +", " ").toLowerCase().split(" ");
int count = 0;
tf = new Tfidf().TF(file, terms);
idf = new Tfidf().idf(alldocument, terms);
tfidf = tf * idf;
tfidfvector[count] = tfidf;
count++;
}
tfidfVector.add(tfidfvector);
}
}
谁能告诉我如何计算查询的 TF-IDF 向量如果我的查询是“life and learning”?如何计算所有文档之间查询的余弦相似度找到查询和所有文档之间的相似性?
【问题讨论】:
标签: java list arraylist vector tf-idf