【发布时间】:2021-10-14 04:27:18
【问题描述】:
我正在尝试使用来自 FastText (https://fasttext.cc/docs/en/pretrained-vectors.html) 的多语言预训练 Wiki 词向量。
我通过以下方式从网站上抓取了向量:
import requests
# link to vector file for German
url = 'https://dl.fbaipublicfiles.com/fasttext/vectors-aligned/wiki.de.align.vec'
r = requests.get(url, stream = True)
if r.encoding is None:
r.encoding = 'utf-8'
with open('/Users/LNV/OneDrive/Desktop/Jupiter_Notebook/Intro to ML/vector-biases/data/extract_DE.txt', 'w', encoding="utf-8") as fp:
for line_num, vector in enumerate(r.iter_lines(decode_unicode = True)):
fp.write(vector)
fp.write('\n')
# first 20,000 words
if line_num == 20_001:
break
并删除了第一行:
deu_input = open('/Users/LNV/OneDrive/Desktop/Jupiter_Notebook/Intro to ML/vector-biases/data/extract_DE.txt', 'r', encoding="utf-8").readlines()
with open('/Users/LNV/OneDrive/Desktop/Jupiter_Notebook/Intro to ML/vector-biases/data/extract_DE_nofirstline.txt', 'w', encoding="utf-8") as deu_output:
for index, line in enumerate(deu_input):
if index != 0:
deu_output.write(line)
我正在做的事情适用于某些语言或一定数量的向量,但对于某些其他语言或超过一定数量的元素,我会收到以下错误:
Traceback (most recent call last):
File "explorer_ES.py", line 22, in <module>
ns = neighbours(vectors,w,20) # neighbours is what I imported from utils, w is the word I entered, and I get 20 examples of nearest neighbours
File "/mnt/c/Users/LNV/OneDrive/Desktop/Jupiter_Notebook/Intro to ML/vector-biases/utils.py", line 31, in neighbours
cos = cosine_similarity(dm, w, k)
File "/mnt/c/Users/LNV/OneDrive/Desktop/Jupiter_Notebook/Intro to ML/vector-biases/utils.py", line 21, in cosine_similarity
num = np.dot(dm[w1],dm[w2])
File "<__array_function__ internals>", line 5, in dot
ValueError: shapes (300,) and (299,) not aligned: 300 (dim 0) != 299 (dim 0)
例如,我在尝试使用我之前抓取的德语文件(我还删除了第一行)时遇到了这个错误。我在使用其他语言时遇到了同样的错误,但对于其他语言却没有。
from utils import readDM, cosine_similarity, neighbours
import sys
fasttext_vecs="./data/extract_DE_nofirstline.txt"
print("Reading vectors...")
vectors = readDM(fasttext_vecs)
f = ""
while f != 'q':
f = input("\nWhat would you like to do? (n = nearest neighbours, s=similarity, q=quit) ")
while f == 'n':
w = input("Enter a word or 'x' to exit nearest neighbours: ")
if w == 'x':
f = 'x'
else:
ns = neighbours(vectors,w,20) # neighbours is what I imported from utils, w is the word I entered, and I get 20 examples of nearest neighbours
print(ns)
while f == 's':
w = input("Input two words separated by a space or 'x' to exit similarity: ")
if w == 'x':
f = 'x'
else:
w1,w2 = w.split() # splits a string into a list
if w1 in vectors and w2 in vectors:
sim = cosine_similarity(vectors,w1,w2)
print("SIM",w1,w2,sim)
else:
print("Word(s) not found in space.")
【问题讨论】:
-
为什么要切割原始向量?请显示一个引发此异常的详细示例,指定所选的词向量和用户输入。
-
@StefanoFiorucci-anakin87 使用整个矢量文件太过分了,因为我想为多种语言做这件事,而我的电脑又不是那么强大。在偶然发现这个问题之前,我已经能够成功地使用另一种语言的 40,000 个向量,但是例如使用德语时,我只在 20,000 个向量之后就得到了错误。我已经用我得到的完整错误编辑了原始帖子。
标签: python vector multilingual fasttext