如何使用 Spacy 获得两个对齐文本的相似度的行级度量？答案

【问题标题】：How to obtain a line-level measure of the similarity of two aligned texts with Spacy?如何使用 Spacy 获得两个对齐文本的相似度的行级度量？
【发布时间】：2020-04-27 10:43:23
【问题描述】：

我有两个对齐的英文文档，每个文档的行数相同（大约 30k）。我想获得每行相似度的度量，即 text_a 中的 line_1 与 text_b 中的 line_1 , text_a 中的 line_2 与 text_b 中的 line_2 等等。（每一行可能包含一个以上的句子）我已经这样做了：

import spacy 
nlp = spacy.load('en_core_web_lg')

file_a = open('text-1.txt', 'r')
file_b = open ('text-2.txt', 'r')
a_doc = nlp(file_a)
b_doc = nlp(file_b)

for a,b in zip(a_doc, b_doc):    
    print("similarity:", a.similarity(b))

但我收到以下错误：

if len(text) > self.max_length:
TypeError: object of type '_io.TextIOWrapper' has no len()

你能帮帮我吗？非常感谢

【问题讨论】：

标签： python list file-handling spacy

【解决方案1】：

nlp 需要一个字符串，而不是文件处理程序对象。

试试这个

a_doc = nlp("".join(file_a.readlines()))
b_doc = nlp("".join(file_b.readlines()))

【讨论】：

【解决方案2】：

nlp() 需要一个字符串，而不是文件对象。我将您的代码稍微编辑为：

import spacy
nlp = spacy.load('en_core_web_sm')

file_a = open('text-1.txt', 'r').read()
file_b = open ('text-2.txt', 'r').read()
a_doc = nlp(file_a)
b_doc = nlp(file_b)

for a,b in zip(a_doc, b_doc):
    print("similarity:", a.similarity(b))

它运行良好

【讨论】：