您可能希望以并行方式创建它,例如 numpy
说你的数据格式
#
# X contained 1000 lists of size 50
#
import numpy as np
X = np.random.random( (50,1000) )
#
# v contains the vector you want to calculate the distance to
#
v = np.random.random( (50,1) )
比循环方法是
%%timeit
#
# for loop approach
#
from scipy import spatial
similarity_score=[]
for i in X.T:
similarity_score.append(spatial.distance.cosine(i,v))
在我的机器上给了
82.3 ms ± 13.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
并产生以下输出
similarity_score[:10]
[0.22282765699585905,
0.2160367009172488,
0.30853097430098786,
0.24034072729579192,
0.16217833767527134,
0.2829791739176786,
0.18946375557860284,
0.19624968983011593,
0.2484078232716126,
0.3258394812037617]
当我们在 numpy 中实现这个并行时
%%timeit
#
# Parallel approach using np.einsum
#
I = np.einsum("ij,ij->j", X,v)
D = 1 - I / ( np.linalg.norm(X,ord=2,axis=0) * np.linalg.norm(v,ord=2,axis=0) )
我们得到
191 µs ± 63.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
当然还要检查输出
D[:10]
array([0.22282766, 0.2160367 , 0.30853097, 0.24034073, 0.16217834,
0.28297917, 0.18946376, 0.19624969, 0.24840782, 0.32583948])
注意示例的输出不是同一类型,numpy 会输出一个numpy 数组。