根据熊猫数据框的索引从稀疏矩阵中选择行答案

【问题标题】：Selecting rows from a sparse matrix based on the index of a panda dataframe根据熊猫数据框的索引从稀疏矩阵中选择行
【发布时间】：2015-10-19 16:56:16
【问题描述】：

假设我有一个形状为 = (2,500,000, M) 的熊猫数据框和一个形状为 (2,500,000, N) 的 scipy csr 稀疏矩阵。

数据框和稀疏矩阵的每一行描述一个实体。它们已经排序，因此数据帧的第 1 行描述了一个实体，该实体也在稀疏矩阵的第 1 行中找到。所以现在数据帧有一个快速的过滤机制（catalogue.where(catalogue.some_column != ''），但是给定过滤后的数据帧，我如何在稀疏矩阵中找到相应的行呢？

假设数据帧被称为catalogue，稀疏矩阵被称为collection

def collection_filter_row(catalogue_filtered, catalogue_index_full, collection):
    return scipy.sparse.vstack(ThreadPool(100).map(
        functools.partial(collection_get_row,
             catalogue_index=tuple(catalogue_index_full),
             collection=collection),
        tuple(catalogue_filtered.index.values)))

def collection_get_row(document_id, catalogue_index, collection):
    return collection.getrow(catalogue_index.index(document_id))

collection_partial = partial(
    collection_filter_row,
    catalogue_index_full=catalogue.index.values,
    collection=pickle.load(open('collection-tfidf', 'rb')))
criteria = catalogue['criteria'].where(catalogue.criteria != '')
collection_state = collection_partial(criteria)

但即使使用任何类型的多处理（gevent、线程池），选择相应的行仍然很慢，我做错了什么（或者更确切地说，有没有更快的方法）？

【问题讨论】：

vstack 是sparse 版本吗？
您知道哪些步骤需要时间吗？没有游泳池，collection.getrow 是不是慢了一步？还是catalogue_index.index？多处理的东西可能正在工作，但它掩盖了稀疏索引步骤 - 这似乎是您询问的部分。
您是否探索过一次从collection 中选择多行？它是如何扩展的？
@hpaulj（在 vstack 上）是的，（这很慢？）我还尝试一次从集合中选择多行，但是在尝试编译从集合中返回的行时会出现缓慢

标签： python pandas scipy

【解决方案1】：

不知何故找到了一种更快的方法来解决这个问题。首先创建一个catalogue index => collection index 的字典。

index_dict = dict(zip(
    catalogue.index.values.tolist(),
    range(collection.shape[0])))

那么我的collection_filter_row就变成了

def collection_filter_row(catalogue_filtered, index_dict, collection):
    return collection[[index_dict[document_id]
                       for document_id
                       in catalogue_filtered.index.values.tolist()]]

为了返回集合的一个子集，而不是使用catalogue.where()，我真的应该使用catalogue.loc[catalogue.some_column != '']，所以正确调用collection_filter_row是那么

collection_sub = collection_filter_row(
    catalogue.loc[catalogue.some_column != ''],
    index_dict,
    collection)

比问题中显示的原始方法快得多

【讨论】：