如何估计查询对特定文档的重要性？答案

【问题标题】：How to estimate the importance of a query for a particular document?如何估计查询对特定文档的重要性？
【发布时间】：2019-06-02 23:28:24
【问题描述】：

我有两个单词列表：

q = ['hi', 'how', 'are', 'you']

doc1 = ['hi', 'there', 'guys']

doc2 = ['how', 'is', 'it', 'going']

有没有办法计算q 和doc1 和doc2 之间的“相关性”或重要性分数？我的直觉告诉我，我可以通过 IDF 做到这一点。因此，这是 idf 的一个实现：

def IDF(term,allDocs):
    docsWithTheTerm = 0
     for doc in allDocs:
            if term.lower() in allDocs[doc].lower().split():
                docsWithTheTerm = docsWithTheTerm + 1
            if docsWithTheTerm > 0:
                return 1.0 + log(float(len(allDocs)) / docsWithTheTerm)
            else:
                return 1.0

但是，这并没有给我自己像“相关性分数”这样的东西。 IDF 是获得相关性分数的正确方法吗？在 IDF 的情况下，测量给定文档的查询重要性的方法不正确，我如何才能获得“相关性分数”之类的东西？

【问题讨论】：

标签： python machine-learning nlp artificial-intelligence information-retrieval

【解决方案1】：

使用 tf-idf 的前提是强调文本中出现的稀有词：前提是关注过于常见的词将无法确定哪些词是有意义的，哪些是没有意义的。

在您的示例中，以下是在 Python 中实现 tf-idf 的方法：

doc1 = ['hi', 'there', 'guys']
doc2 = ['how', 'is', 'it', 'going']
doc1=str(doc1)
doc2=str(doc2)

stringdata=doc1+doc2
stringdata

import re
text2=re.sub('[^A-Za-z]+', ' ', stringdata)

from nltk.tokenize import word_tokenize
print(word_tokenize(text2))
text3=word_tokenize(text2)

单词已被标记化，如下所示：

['hi', 'there', 'guys', 'how', 'is', 'it', 'going']

然后，生成一个矩阵：

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
matrix = vectorizer.fit_transform(text3).todense()

这是矩阵输出：

matrix([[0., 0., 1., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 1.],
        [0., 1., 0., 0., 0., 0., 0.],
        [0., 0., 0., 1., 0., 0., 0.],
        [0., 0., 0., 0., 1., 0., 0.],
        [0., 0., 0., 0., 0., 1., 0.],
        [1., 0., 0., 0., 0., 0., 0.]])

但是，为了理解这个矩阵，我们现在希望将其存储为 pandas 数据框，词频按升序排列：

import pandas as pd

# transform the matrix to a pandas df
matrix = pd.DataFrame(matrix, columns=vectorizer.get_feature_names())
# sum over each document (axis=0)
top_words = matrix.sum(axis=0).sort_values(ascending=True)

这是我们想出的：

going    1.0
guys     1.0
hi       1.0
how      1.0
is       1.0
it       1.0
there    1.0
dtype: float64

在此示例中，单词几乎没有上下文 - 所有三个句子都是常见的介绍。因此，tf-idf 不一定会在此处显示任何有意义的信息，但例如在包含 1000 多个单词的文本的上下文中，tf-idf 在确定单词之间的重要性方面可能非常有用。例如您可能会认为在文本中出现 20 到 100 次的单词很少见，但经常出现足以值得重视。

在这种特殊情况下，可以通过确定查询中的单词在相关文档中出现的次数来潜在地获得相关性分数 - 特别是 tf-idf 标记为重要的单词。

【讨论】：

【解决方案2】：

基本上，您必须以某种方式将单词表示为数字，以便您可以对它们进行算术运算以找到“相似性”。 TF-IDF 就是这样一种方式，Michael Grogan 的回答应该让您从那里开始。

另一种方法是使用预训练的 Word2Vec 或 GloVe 模型。这些词嵌入模型将词映射到一组数字，这些数字代表词的语义含义。

Gensim 等库可让您非常轻松地使用预训练的嵌入模型来衡量相似度。见这里：https://github.com/RaRe-Technologies/gensim-data

===

编辑：如需更高级的词嵌入，请查看 ELMo 或 BERT

【讨论】：