Python：如何最好地更快地发现常见索引？答案

【问题标题】：Python: how best to discover common indices faster?Python：如何最好地更快地发现常见索引？
【发布时间】：2021-10-10 16:25:37
【问题描述】：

我想出了以下方法来查找所有公共索引，其中值在两个相等长度的向量中存在。我喜欢它的可读性，但我需要它更快...

missingA = np.argwhere(np.isnan(vectorA)==True);
missingA = [missingA[ma][0] for ma in range(len(missingA))];

missingB = np.argwhere(np.isnan(vectorB)==True);
missingB = [missingB[mb][0] for mb in range(len(missingB))];

allmissidxs = set(missingA).union(set(missingB)); 
idxs = [idx for idx in range(len(vectorA))   if idx not in allmissidxs];

它绝对有效，但我需要在其上使用它的向量每个元素从 100 万到 300 万不等……并且可能需要运行多次。我使用“...如果 idx 不在 allmissidxs 中”而不是说“...如果 idx 在 allpresidxs 中”，因为缺失值肯定是要扫描的小得多的子集。另外，我敢肯定，考虑到 np.argwhere() 自然返回的结构，必须重新配置 missingA 和 missingB 也无济于事，但这真的是瓶颈吗？

任何帮助将不胜感激！谢谢

【问题讨论】：

标签： python performance numpy bigdata processing-efficiency

【解决方案1】：

假设源向量与其他解决方案中的相同：

vectorA = np.array([np.nan, 1., 2., 3.,     np.nan, 5.,     np.nan, 7.,
    8., np.nan])
vectorB = np.array([0.,     1., 2., np.nan, 4.,     np.nan, 6.,     np.nan,
    8., np.nan])

您可以使用 Pandasonic Index 及其 intersection 方法来完成您的任务。甚至可以写成以下单行：

result = pd.Index(vectorA).intersection(vectorB)

结果是：

Float64Index([1.0, 2.0, 8.0], dtype='float64')

如果您希望将结果作为 Numpy 向量，请将.values 附加到上述代码中结果将是：

array([1., 2., 8.])

这种方法的优点是可以避免任何列表推导，所以这段代码应该比你的运行得快得多。在更大的数据样本上自行检查。

【讨论】：

谢谢！这看起来确实运行得更快，但它在结果中包含了 nan - 知道为什么尽管它在较小的玩具示例中没有这样做吗？ a=np.random.normal(0,1,3000000); b=imbue_missing(a);结果 = pd.Index(a).intersection(b); print(result) >>> Float64Index([ 0.5743935953457322, -0.7174387885462609, nan, 1.427427325840093, -0.13925936048882145], dtype='float64', length=2791570) 另外，我更喜欢获取这些值他们自己。有内置的方法吗？
请注意，我的源数据确实包含 NaN，作为“真实”np.nan 值，而结果不包含。也许您的源向量包含“nan”作为字符串？如果是这种情况，请先将“nan”字符串替换为“true”NaN 值，然后运行我的代码。