【发布时间】:2021-12-07 16:45:44
【问题描述】:
我有两个文件目录。一个包含人工转录文件,另一个包含 IBM Watson 转录文件。两个目录都有相同数量的文件,并且都是从相同的电话录音转录而来的。
我正在使用 SpaCy 的 .similarity 计算匹配文件之间的余弦相似度,并将结果与比较的文件名一起打印或存储。除了 for 循环之外,我还尝试使用函数进行迭代,但找不到在两个目录之间进行迭代、将两个文件与匹配索引进行比较并打印结果的方法。
这是我当前的代码:
# iterate through files in both directories
for human_file, api_file in os.listdir(human_directory), os.listdir(api_directory):
# set the documents to be compared and parse them through the small spacy nlp model
human_model = nlp_small(open(human_file).read())
api_model = nlp_small(open(api_file).read())
# print similarity score with the names of the compared files
print("Similarity using small model:", human_file, api_file, human_model.similarity(api_model))
我已经让它只遍历一个目录并通过打印文件名检查它是否具有预期的输出,但是在使用两个目录时它不起作用。我也尝试过这样的事情:
# define directories
human_directory = os.listdir("./00_data/Human Transcripts")
api_directory = os.listdir("./00_data/Watson Scripts")
# function for cosine similarity of files in two directories using small model
def nlp_small(human_directory, api_directory):
for i in (0, (len(human_directory) - 1)):
print(human_directory[i], api_directory[i])
nlp_small(human_directory, api_directory)
返回:
human_10.txt watson_10.csv
human_9.txt watson_9.csv
但这只是其中两个文件,不是全部 17 个。
任何关于迭代两个目录上的匹配索引的指针将不胜感激。
编辑: 感谢@kevinjiang,这是工作代码块:
# set the directories containing transcripts
human_directory = os.path.join(os.getcwd(), "00_data\Human Transcripts")
api_directory = os.path.join(os.getcwd(), "00_data\Watson Scripts")
# iterate through files in both directories
for human_file, api_file in zip(os.listdir(human_directory), os.listdir(api_directory)):
# set the documents to be compared and parse them through the small spacy nlp model
human_model = nlp_small(open(os.path.join(os.getcwd(), "00_data\Human Transcripts", human_file)).read())
api_model = nlp_small(open(os.path.join(os.getcwd(), "00_data\Watson Scripts", api_file)).read())
# print similarity score with the names of the compared files
print("Similarity using small model:", human_file, api_file, human_model.similarity(api_model))
这是大部分输出(需要在一个停止循环的文件中修复 UTF-16 字符):
nlp_small = spacy.load('en_core_web_sm')
Similarity using small model: human_10.txt watson_10.csv 0.9274665883462793
Similarity using small model: human_11.txt watson_11.csv 0.9348740684005554
Similarity using small model: human_12.txt watson_12.csv 0.9362025469343344
Similarity using small model: human_13.txt watson_13.csv 0.9557355330988958
Similarity using small model: human_14.txt watson_14.csv 0.9088701120190216
Similarity using small model: human_15.txt watson_15.csv 0.9479464053189846
Similarity using small model: human_16.txt watson_16.csv 0.9599724037676819
Similarity using small model: human_17.txt watson_17.csv 0.9367605599306302
Similarity using small model: human_18.txt watson_18.csv 0.8760760037870665
Similarity using small model: human_2.txt watson_2.csv 0.9184563762823503
Similarity using small model: human_3.txt watson_3.csv 0.9287452822270265
Similarity using small model: human_4.txt watson_4.csv 0.9415664367046419
Similarity using small model: human_5.txt watson_5.csv 0.9158895909429551
Similarity using small model: human_6.txt watson_6.csv 0.935313240861153
在我修复了字符编码错误后,我将把它封装在一个函数中,这样我就可以在两个目录中调用大模型或小模型,以获取我必须测试的剩余 API。
【问题讨论】:
标签: python nlp spacy cosine-similarity