成对相似度/相似度矩阵计算优化答案

【问题标题】：Pairwise similarity/similarity matrix calculation optimization成对相似度/相似度矩阵计算优化
【发布时间】：2020-07-16 09:24:01
【问题描述】：

问题定义

问题

如何优化计算大量向量的成对余弦相似度（估计套装）？

正式定义

对于包含向量的两个集合 (A, B) - 需要为每个 a 和 b 生成成对余弦相似度 sim(a_i, b_j)。（余弦相似度矩阵也适合，因为它很容易从矩阵转换为成对的。）

我为什么要寻求帮助

由于在计算生物学、推荐系统等中需要计算这样的距离，这看起来是一个常见问题。但我还没有找到一些合理的解决方案。

我无法解决的问题

根据定义，这个问题的复杂度是 O(len_A * len_B * O(similarity_function)) 所以 A 和 B 集合中的 10^6 个向量往往会耗费大量的运行时间

我对未来方向的假设

看起来，我们在这里做了很多无用的工作，因为相似性不是独立的（如果我们有 a_i 的相似性计算为百万向量，并且 b_j 与 a_i 非常相似 - 我们有 b_j 相似性为 900k计算出的向量，我们可以估计 b_j 与其余 100k 个向量的相似性）。我认为这里可以使用索引之类的东西。

其他详情

A 和 B 不相交。
向量维度已经降低到最小的合理值。
不需要简单的 for 循环优化。简而言之 - 这里有一个简短的 guide 用于优化这个 - 给出了最简单的循环来清楚地说明算法。
我很感兴趣是否有一种算法也可以进行估计，所以如果我们的相似度足够接近但与真实的不完全相同就可以了。
不需要并行化。
我知道生成的相似度矩阵会很大。
我也很感兴趣，如果这是一种算法，它只允许从集合 B 中为集合 A 中的每个向量获取最相似的向量。

感谢您的参赛作品。

代码示例

要求

python==3.6
pandas==0.25.0
scikit-learn==0.21.3
numpy==1.17.1

生成虚拟数据

import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity 

df_1 = pd.DataFrame({'object_id_1': range(10),
                   'feature_0': np.random.uniform(0,1,10),
                   'feature_1': np.random.uniform(0,1,10),
                   'feature_2': np.random.uniform(0,1,10),
                   'feature_3':np.random.uniform(0,1,10)})

df_2 = pd.DataFrame({'object_id_2': range(10,20),
                   'feature_0': np.random.uniform(0,1,10),
                   'feature_1': np.random.uniform(0,1,10),
                   'feature_2': np.random.uniform(0,1,10),
                   'feature_3':np.random.uniform(0,1,10)})

相似度生成函数

def get_similarities(df_1: pd.DataFrame, df_2: pd.DataFrame, meaningful_features:list) -> pd.DataFrame:
    '''
    This function generates features based similarity scores, between two groups of objects
    
    Parameters
    ----------
    df_1: pandas.DataFrame
        DataFrame with features, and id_s of objects
    df_2: pandas.DataFrame
        DataFrame with features, and id_s of objects which has no id_s same to df_1
    meaningful_features: list
        Features columns to calculate similarity on
        
    Returns
    ----------
        similarities_of_objects: pandas.DataFrame
            DataFrame, with columns 'object_id_1', 'object_id_2', 'similarity', 
            where we have features similarity, for each object_1-object_2 pair. 
            Similarity - symmetric.  
    '''

    objects_1 = [] #  list of all objects from df_1
    objects_2 = [] #  list of all objects from df_2
    similarities = [] #  list of scores for object_1-object_2 pairs

    for object_1 in df_1['object_id_1'].unique():
        features_vector_1 = df_1[df_1['object_id_1'] == object_1][meaningful_features] # object_1 features vector
        
        for object_2 in df_2['object_id_2'].unique():
            features_vector_2 = df_2[df_2['object_id_2'] == object_2][meaningful_features] # object_2 features vector
            
            objects_1.append(object_1)
            objects_2.append(object_2)
            similarities.append(cosine_similarity(X = np.array(features_vector_1)
                                    ,Y = np.array(features_vector_2)).item()) # similarities of vectors 
    
    sim_o1_to_o2 = pd.DataFrame()

    sim_o1_to_o2['objects_1']= objects_1
    sim_o1_to_o2['objects_2']= objects_2
    sim_o1_to_o2['similarity']= similarities

    return sim_o1_to_o2

产生相似性

get_similarities(df_1,df_2, ['feature_0', 'feature_1', 'feature_2'])

【问题讨论】：

你需要相似度矩阵吗？
一般 - 是的。
我喜欢你的问题，我只是想知道你可能会遇到另一个问题，而不仅仅是运行时间。每组 10^6 个向量表示具有 10^12 个条目的相似度矩阵。每个条目都由一个 64 位的浮点数表示，这意味着要存储您的矩阵，您将需要大约 8 TB！也许你可以优化它（对称性，对角线 = 1），但它仍然很大。
@Tinu 谢谢你的评论。好提示。我了解这个问题并且已经有合适的存储空间。不幸的是，我还没有找到相似矩阵生成的合适替代方案，这是一种罕见的情况:) - 这就是我一直在努力解决这个问题的原因。
实际上我错了——因为对于大多数算法来说，所有矩阵都应该适合 ram（而不是我预期的 rom）。所以@Tinu 的评论更有帮助，我只是无法理解它:)

标签： python pandas algorithm numpy similarity

【解决方案1】：

使用Faiss

import faiss

dimension = 100

value1 = np.random.random((n, dimension)).astype('float32')
index = faiss.IndexFlatL2(d)
index.add(value1)

xq = value2
k= len(value1)
D, I = index.search(xq, k)

注意这里 D 是距离，I 是值的索引。

另外，value1 和 value2 只不过是 NumPy 数组。

PS：先安装faiss。

pip install faiss

【讨论】：

这是一个相当普遍的问题。我对 Faiss github.com/erikbern/ann-benchmarks 有点偏向于查看更多信息

【解决方案2】：

如何从欧几里得距离获得余弦相似度

仅适用于最相似的向量

Here's，也是计算欧几里得距离的替代方法，尤其适用于只需要顶部相似向量而不需要整个相似矩阵的情况。

用@Abhik Sarka 提出的方法解决

这是我发布的确切问题的解决方案，使用@Abhik Sarkar 提出的方法。要具有余弦相似性，请确保您的向量先前已标准化。此解决方案还允许您根据需要生成尽可能多的相似性，而不需要完整的矩阵。

免责声明：解决方案侧重于可读性，而不是性能。

要求

python==3.6
pandas==0.25.0
numpy==1.17.1
faiss==1.5.3

生成虚拟数据

import pandas as pd
import numpy as np
import faiss 

df_1 = pd.DataFrame({'object_id_1': range(10),
                   'feature_0': np.random.uniform(0,1,10),
                   'feature_1': np.random.uniform(0,1,10),
                   'feature_2': np.random.uniform(0,1,10),
                   'feature_3':np.random.uniform(0,1,10)})

df_2 = pd.DataFrame({'object_id_2': range(10,20),
                   'feature_0': np.random.uniform(0,1,10),
                   'feature_1': np.random.uniform(0,1,10),
                   'feature_2': np.random.uniform(0,1,10),
                   'feature_3':np.random.uniform(0,1,10)})

相似度生成函数

def get_similarities(df_1: pd.DataFrame, 
                     df_2: pd.DataFrame, 
                     meaningful_features:list, 
                     n_neighbors:int = df_2.shape[0])->pd.DataFrame:
    '''
    This function generates features based similarity scores, between to groups of reviews
    
    Parameters
    ----------
    df_1: pandas.DataFrame
        DataFrame with features, and id_s of objects
    df_2: pandas.DataFrame
        DataFrame with features, and id_s of objects which has no id_s same to df_1
    meaningful_features: list
        Features columns to calculate similarity on
    n_neighbors: int
        Number of most similar objects_2 for every object_1. By default - full similarity matrix generated.
        (default = df_2.shape[0]) 
    
    Returns
    ----------
        similarities_of_objects: pandas.DataFrame
            DataFrame, with columns 'object_id_1', 'object_id_2', 'similarity', 
            where we have features similarity, for each object_1-object_2 pair. 
            Similarity - symmetric.  
    '''
    d = len(meaningful_features) #  dimensionality
    
    res = np.empty(shape=[1, 3]) #  res initialization
    
    xb = np.float32(df_1[meaningful_features].values)
    xb = np.ascontiguousarray(xb)
    
    xq = np.float32(df_2[meaningful_features].values)
    xq = np.ascontiguousarray(xq)

    index = faiss.IndexFlatL2(d) #  build the index
    index.add(xb)                #  add vectors to the index
    
    D, I = index.search(xq, n_neighbors)     # actual search
    
    for i in range(I.shape[0]): 
        object_id_1_v = [df_1["object_id_1"].iloc[i]]*n_neighbors
        object_id_2_v = df_2["object_id_2"].iloc[I[i]]
        similarities = 1-D[i]/2
        
        neighbors_scores_for_target = np.stack((object_id_1_v, object_id_2_v, similarities), axis=-1)
        res = np.concatenate((res, neighbors_scores_for_target))
        
    res = res[1:] #  remove line we've created during res initialization
    
    resulting_df = pd.DataFrame({'object_id_1': res[:, 0], 
                                 'object_id_2': res[:, 1],
                                 'similarity':  res[:, 2] })

    
    return resulting_df

产生相似性

get_similarities(df_1,df_2, ['feature_0', 'feature_1', 'feature_2'])

【讨论】：