产生加权边缘列表的内存有效方法答案

【问题标题】：Memory efficient way of producing a weighted edge list产生加权边缘列表的内存有效方法
【发布时间】：2018-03-31 23:10:58
【问题描述】：

我有一个按 ID 索引的特征数据框。

ID1, Red, Green, Blue
ID2, Yellow, Green, Orange
ID3, Gray, Green, Yellow
ID4, Yellow, Green, Blue

我正在尝试生成一个以余弦相似度作为权重的边列表，而不首先生成邻接矩阵。

我有足够的计算时间，但内存受限且数据集很大。

需要这个，不包括权重0：

ID1 ID2 Weight (cosine similarity)
01 02 0.33
01 03 0.25
01 04 0.75

（重量仅供参考）

这是我通过邻接矩阵解决这个问题的方法。

import pandas as pd
import numpy as np 
from sklearn.metrics.pairwise import cosine_similarity

df = df.pivot_table(index = ('ID'), columns= 'color', aggfunc=len, fill_value=0)
matrix = df.as_matrix().astype(np.float32)
matrix = cosine_similarity(matrix)

使用组合我能够生成列表，但不确定如何应用 cosine_similarity 排除零以防止填满内存。

edge_list = pd.DataFrame(list(combinations(df.index.tolist(), 2)), columns=['Source', 'Target'])

欣赏输入。谢谢，

【问题讨论】：

标签： pandas numpy networkx cosine-similarity

【解决方案1】：

这是一个非常简单的for loop 方法：

from sklearn.feature_extraction.text import CountVectorizer

vect = CountVectorizer()
X = vect.fit_transform(df.add(' ').sum(1))

data = []
for i1, i2 in combinations(df.index.tolist(), 2):
    data.append([i1, i2,
                 cosine_similarity(X[df.index.get_loc(i1)], 
                                   X[df.index.get_loc(i2)]).ravel()[0]])
data = pd.DataFrame(data, columns=['Source','Target','Weight'])

结果：

矢量化源 DF：

In [280]: X
Out[280]:
<4x6 sparse matrix of type '<class 'numpy.int64'>'
        with 12 stored elements in Compressed Sparse Row format>

In [281]: X.A
Out[281]:
array([[1, 0, 1, 0, 1, 0],
       [0, 0, 1, 1, 0, 1],
       [0, 1, 1, 0, 0, 1],
       [1, 0, 1, 0, 0, 1]], dtype=int64)

将其表示为稀疏 DF：

In [282]: pd.SparseDataFrame(X, columns=vect.get_feature_names(), default_fill_value=0)
Out[282]:
   blue  gray  green  orange  red  yellow
0     1     0      1       0    1       0
1     0     0      1       1    0       1
2     0     1      1       0    0       1
3     1     0      1       0    0       1

产生的 DF：

In [283]: data
Out[283]:
  Source Target    Weight
0    ID1    ID2  0.333333
1    ID1    ID3  0.333333
2    ID1    ID4  0.666667
3    ID2    ID3  0.666667
4    ID2    ID4  0.666667
5    ID3    ID4  0.666667

【讨论】：

谢谢。使用 X 矩阵和 DF，我设置了一个组合列表。我用它来迭代和填充数据：for i1, i2 in comb_list: data.append([i1, i2, cosine_similarity(X[df.index.get_loc(i1)], X[df.index.get_loc(i2)]).ravel()[0]]) 我收到以下错误：Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
余弦相似度函数内的参数需要更多括号。 for i1, i2 in comb_list: data.append([i1, i2, cosine_similarity([X[df.index.get_loc(i1)]], [X[df.index.get_loc(i2)]]).ravel()[0]])。尽管如此，它并没有排除余弦相似度为0的组合，因此它不是一个完整的解决方案。需要在追加之前添加一个 if 语句，但不能让它与列表输出一起工作：if [i1, i2, cosine_similarity([X[df.index.get_loc(i1)]], [X[df.index.get_loc(i2)]]).ravel()[0]] > 0:@MaxU