Pandas / numpy 从矩阵中获得前 10 名答案

【问题标题】：Pandas / numpy get top 10 from matrixPandas / numpy 从矩阵中获得前 10 名
【发布时间】：2021-10-10 02:03:22
【问题描述】：

我正在计算一个相似度矩阵，结果非常大，我想减小大小。

这是我现有的代码：

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
from sql_con import get_con

df = pd.read_csv('test.csv')

df = df[['id','value1', 'value2']]

df = df.set_index('id')

sim= squareform(pdist(df, metric='cosine'))

sim_df= pd.DataFrame(sim,columns = df.index,index = df.index)

table= sim_df.unstack()

table.index.rename(['id_1', 'id_2'], inplace=True)
table= table.to_frame('distance')
table.reset_index(inplace=True)

结果数据表如下所示

id_1 id_2 distance

 a    b      0.1
 a    c      0.2

为了减小大小，对于矩阵中的每个元素/id，我只想取前 n 个（例如 10 个）最接近/最相似的元素，实际上这实际上意味着获得最小的 10 个。

一旦数据位于数据框“表”变量中，我曾尝试减小数据的大小，但由于 pandas 的工作方式，为减小数据大小而采取的任何步骤都会增加/增加内存使用量。出于这个原因，我想看看当它仍然是一个 numpy 数组（变量“sim”）时，减少数据大小的选项是什么。

什么是一种有效的（就内存和时间而言）方法来减少这个矩阵的大小，只为每个 id 取前 n 个最接近的 id。

【问题讨论】：

看来你在找np.argpartition
您打算如何使用“紧凑”矩阵？据我了解，您目前有一个以对称矩阵编码的映射(Element,OtherElement)->Similarity，并且您想要一个对Rank 有效的映射(Element, Rank)->(OtherElement,Similarity) 在1 到10 之间。理解正确吗？
理想情况下，我想从对称矩阵转到像上面添加的那样的 id 成对距离表
你好@MustardTiger，我已经更新了我的答案，为你提供了一个 id 成对距离表，就像你更新的问题一样。

标签： python arrays pandas dataframe numpy

【解决方案1】：

如果我正确理解了这个问题，你有一个非常大的矩阵，你想以某种方式降低它的维度。

我认为应该相当有效的一种方法是执行以下操作。

此方法假设您有一个 m*m 相似度矩阵 sim，其索引对应于 df.index。即 sim[i] 对应于第一个索引标签的相似值行。

虽然我不确定如何找到sim，但根据我对page 的理解以及您的问题，我认为应该这样做：

sim = squareform(pdist(df, metric='cosine'))

其余如下：


n = 10

# an array that stores each label in its corresponding index position in sim matrix
index_labels = df.index.to_numpy()

# get the top n indexes with highest similarity for each row in the matrix (i.e the 'reduced' matrix)
highest_similarity_idxs = np.argsort(sim, axis=1)[:, :n]

# get the labels of the highest similarity items
highest_similarity_labels = index_labels[highest_similarity_idxs]

# highest similar labels for label at pos 0
label = index_labels[0] # any label you are interested in
row_idx = index_labels.tolist().index(label)
most_similar_labels = highest_similarity_labels[row_index]

# get the actual similarity values for each row
highest_similarity_values = np.take_along_axis(sim, highest_similarity_idxs, axis=1)

转换为 id 成对数据帧

要将事物作为您想要的样式的数据框，您可以添加以下内容：

data = {
  'id_1': np.repeat(index_labels, n),
  'id_2': highest_similarity_labels.flatten(),
  'distance': highest_similarity_values.flatten()
}

df = pd.DataFrame(data)

注意：要正确匹配结果表，index_labels 中的标签必须与sim 矩阵中的行索引相对应。

例如给定以下index_labels 和sim 矩阵，我们有：

index_labels = np.array(['a', 'b'])
sim = np.array([[0, 0.2], 
                [0.3, 0]])

# sim[0, 1] is the distance from the label 'a' to 'b' which is 0.2
# sim[0, 0] is the distance from the label 'a' to 'a' which is 0, etc.

希望这能解决你的问题！

【讨论】：

经过测试，效果很好。与我之前尝试的 pandas 实现相比，没有内存问题并且运行速度非常快
很好，很高兴听到这个消息。 NumPy 在让事情变得更快方面表现出色。