python pandas：如何从具有其他系列匹配值的系列中获取值的索引？答案

【问题标题】：python pandas : How to get the index of the values from a series that have matching values in other series?python pandas：如何从具有其他系列匹配值的系列中获取值的索引？
【发布时间】：2014-03-07 01:45:34
【问题描述】：

我有这两个系列：

In [48]: serie1
Out[48]: 
0    A
1    B
2    C
3    A
4    D

In [49]: serie2
Out[49]: 
0    X
1    Y
2    A
3    Z
4    A
5    D
dtype: object

对于 serie1 中的每个值，我想从 serie2 获取索引/索引。如果不迭代值，这可能吗？一个可能的解决方案是或多或少地像这样构建一个数据框：

       A      B      C      D
X    False  False  False  False                 
Y    False  False  False  False
A    True   False  False  False
Z    False  False  False  False
A    True   False  False  False
D    False  False  False  True

...然后获取每列“True”值的索引

【问题讨论】：

你想要的输出数据结构是什么？
我没有想要的输出数据结构。我只需要能够在serie2 中找到serie1 值的匹配/匹配（索引）。长答案是：serie2 属于具有另一列的数据框（我们称之为long_name），我想将正确的long_name 添加到serie1。不幸的是，serie1 和 serie2 没有按索引相关，也没有相同的长度。
当您说 "... 然后获取每列“True”值的索引时"，您要么需要布尔矩阵/表格/交叉表，要么匹配的索引......或两者兼而有之。你希望你的匹配索引如何？作为多个系列？作为一个字典？等
看来你需要groupby() 和serie2.groupby(serie1).indices... 一样——看我的回答。

标签： python pandas series

【解决方案1】：

我认为serie2.index[(array(serie2)=='A').flatten()] 可能有用。 'A' 是您要为其查找索引的值。

或者这个，可能不太可读：serie2.index[(serie2=='A')[0]]

【讨论】：

谢谢，但这意味着要遍历列（在我的实际情况下接近一百万：/）。没有更快的解决方案来验证 series1 中的值在 series2 中的位置吗？
我可以假设serie1 中的唯一值是serie2 中唯一值的子集（我认为这必须是真的）。如果是这样，serie1 中的唯一值会非常多，还是只有少数几个？
是的。 serie1 的唯一值是 serie2 的唯一值的子集。问题是serie1 的某些值可能在serie2 中有不止一个匹配项（因此会有不止一个long_name - 如果您愿意，请参阅我对@DSM 的解释），我不想放弃这一点信息。

【解决方案2】：

1) 对于布尔匹配表：如果你想要一个交叉表（只显示唯一值，不重复），然后将其转换为布尔值：

serie1 = pd.Series(['A','B','C','A','D'])
serie2 = pd.Series(['X','Y','A','Z','A','D'])

pd.crosstab(serie2,serie1) > 0 

col_0      A      B      C      D
row_0                            
A      False  False   True   True
X       True  False  False  False
Y      False   True  False  False
Z       True  False  False  False

（请注意，行索引是按值自动排序的，而不是值在serie1 中出现的顺序。您可以通过玩.reorder_levels(...) 来覆盖它）

2) 至于匹配索引，将它们作为数组的字典...

serie2.groupby(serie1).indices

{'A': array([0, 3]), 'C': array([2]), 'B': array([1]), 'D': array([4])}

# ... or as a list of arrays...
serie2.groupby(serie1).indices.values()

[array([0, 3]), array([2]), array([1]), array([4])]

# Here are alternatives with list comprehensions which are probably less efficient than `Series.groupby()` 
>>> [ np.flatnonzero(serie2.apply(lambda i2: i2==i1)) for i1 in serie1 ]
[array([2, 4]), array([], dtype=int64), array([], dtype=int64), array([2, 4]), array([5])]

>>> [ np.flatnonzero(serie2.apply(lambda i2: i2==i1)).tolist() for i1 in serie1 ]
[[2, 4], [], [], [2, 4], [5]]

【讨论】：