已知重复对的余弦相似度答案

【问题标题】：Cosine similarity for already known pairs of duplicates已知重复对的余弦相似度
【发布时间】：2017-09-17 12:06:42
【问题描述】：

我有一个保存在 csv 文件中的重复文档对列表。第 1 列中的每个 ID 都是第 2 列中相应 ID 的副本。文件是这样的：

Document_ID1    Document_ID2
12345           87565
34546           45633
56453           78645
35667           67856
13636           67845

每个文档 ID 都与保存在其他地方的文本相关联。我提取了此文本并将每一列 ID 和相关文本保存到两个 lsm 数据库中。
所以我有db1，它具有来自Document_ID1 的所有ID 作为 keys 和它们相应的文本作为相应键的 values。因此，就像一本字典。同样，db2 表示来自 Document_ID2 的所有 ID。
因此，当我说 db1[12345] 时，我得到了与 ID 12345 相关联的文本。

现在，我想获得每对之间的余弦相似度分数，以确定它们的重复性。到目前为止，我运行了一个 tfidf 模型来做同样的事情。我创建了一个 tfidf 矩阵，其中 db1 中的所有文档作为语料库，我测量了来自 db2 的每个 tfidf 向量与 tfidf 矩阵的余弦相似度。出于安全原因，我无法提供完整的代码。代码如下：

# Generator function to pick one key (document) at a time for comparison against other documents
def generator(db):
    for key in db.keys():
        text = db[key]
        yield text

# Use spaCy to create a function to preprocess text from the generator function
nlp = spacy.load('en')
def spacy(generator_object):
    for doc in generator_object:
        words = <code to make words lower case, remove stop words, spaces and punctuations>
        yield u' '.join(words)

# TF-IDF Vectorizer
tfidf = TfidfVectorizer(min_df = 2)

# Applying tf-idf transformer to each key from db1 individually in the generator function.
tfidf_matrix = tfidf.fit_transform(spacy(generator(db1)))

# Function to calculate cosine similarity values between the tfidf matrix and the tfidf vector of a new key
def similarity(tfidf_vector, tfidf_matrix, keys):    
    sim_vec = <code to get cosine similarity>
    return sim_vec.sort_values(ascending=False)

# Applying tf-idf transformer on db2 keys on a loop and getting cosine similarity scores for each key from db2.
for key in db2.keys():
    # Create a new temporary db for each key from db2 to enter into generator function
    new = <code to create a temporary new lsm database>
    text = db2[key]
    new[key] = text
    new_key = <code to get next key from the temporary new lsm database>
    tfidf_vector = tfidf.transform(spacy_proc(corpus_gen(new)))
    similarity_values = similarity(tfidf_vector, tfidf_matrix, list(db1.keys()))
    for idx, i in similarity_values.iteritems(): 
            print new_key, idx, i
    del new[key]

但这给了我对 db2 中每个键的 db1 中所有键的余弦相似度分数。示例：如果 db1 中有 5 个键，db2 中有 5 个键，我会得到 25 行作为此代码的结果。
我想要的是从 db1 中获取 db2 中键的对应键的余弦相似度分数。这意味着如果 db1 和 db2 中各有 5 个键，结果我应该只有 5 行 - 每对重复项的余弦相似度得分。

我应该如何调整我的代码来获得它？

【问题讨论】：

标签： python nlp tf-idf cosine-similarity spacy

【解决方案1】：

由于还没有明确的答案，我正在获取包含所有行的数据框（如上例中的 25 行结果）并将其与具有重复对列表的数据框进行内部连接/合并（即我需要的5行输出）。这样，生成的数据帧具有重复文档对的相似度分数。这是一个临时解决方案。如果有人能提出更清洁的解决方案，我会接受它作为答案，如果可行的话。

【讨论】：