处理数组：如何避免“for”语句答案

【问题标题】：dealing with arrays: how to avoid a "for" statement处理数组：如何避免“for”语句
【发布时间】：2012-09-17 09:59:43
【问题描述】：

我有一个名为“a”的 100000000x2 数组，第一列中有一个索引，第二列中有一个相关值。我需要为每个索引获取第二列中数字的中值。这就是我使用 for 语句的方式：

import numpy as np
b = np.zeros(1000000)
a = np.array([[1, 2],
              [1, 3],
              [2, 3],
              [2, 4],
              [2, 6],
              [1, 4],
              ...
              ...
              [1000000,6]])
for i in xrange(1000000):
    b[i]=np.median(a[np.where(a[:,0]==i),1])

显然 for 迭代太慢了：有什么建议吗？谢谢

【问题讨论】：

标签： python arrays for-loop numpy

【解决方案1】：

这称为“分组依据”操作。 Pandas (http://pandas.pydata.org/) 是一个很好的工具：

import numpy as np
import pandas as pd

a = np.array([[1.0, 2.0],
              [1.0, 3.0],
              [2.0, 5.0],
              [2.0, 6.0],
              [2.0, 8.0],
              [1.0, 4.0],
              [1.0, 1.0],
              [1.0, 3.5],
              [5.0, 8.0],
              [2.0, 1.0],
              [5.0, 9.0]])

# Create the pandas DataFrame.
df = pd.DataFrame(a, columns=['index', 'value'])

# Form the groups.
grouped = df.groupby('index')

# `result` is the DataFrame containing the aggregated results.
result = grouped.aggregate(np.median)
print result

输出：

       value
index       
1        3.0
2        5.5
5        8.5

有很多方法可以直接创建包含原始数据的DataFrame，所以不一定要先创建numpy数组a。

更多关于 Pandas 中 groupby 操作的信息：http://pandas.pydata.org/pandas-docs/dev/groupby.html

【讨论】：

这肯定是最好的解决方案，即使如果不是严格需要，我也不喜欢使用库。谢谢！

【解决方案2】：

这有点烦人，但至少您可以使用排序轻松删除烦人的==（这可能是您的速度杀手）。尝试更多可能不是很有用，但如果你自己排序等可能是可能的：

# First sor the whole thing (probably other ways):
sorter = np.argsort(a[:,0]) # sort by class.
a = a[sorter] # sorted version of a

# Now we need to find where there are changes in the class:
w = np.where(a[:-1,0] != a[1:,0])[0] + 1 # Where the class changes.
# for simplicity, append [0] and [len(a)] to have full slices...
w = np.concatenate([0], w, [len(a)])
result = np.zeros(len(w)-1, dtype=a.dtype)
for i in xrange(0, len(w)-1):
    result[0] = np.median(a[w[i]:w[i+1]])

# If the classes are not exactly 1, 2, ..., N we could add class information:
classes = a[w[:-1],0]

如果你所有的类的大小都一样，那么 1 和 2 的数量就一样多。不过还有更好的方法。

编辑： 检查 Bitwises 版本以获取避免最后一个 for 循环的解决方案（他还将其中一些代码隐藏到您可能更喜欢的 np.unique 中，因为速度对此无关紧要反正）。

【讨论】：

塞巴斯蒂安，我相信您可以使用np.diff(a[:,0]) 找出索引发生变化的位置...

【解决方案3】：

这是我的版本，没有 for 也没有附加模块。这个想法是对数组进行一次排序，然后您只需计算 a 第一列中的索引即可轻松获得中位数的索引：

# sort by first column and then by second
b=a[np.lexsort((a[:,1],a[:,0]))]

# find central value for each index
c=np.unique(b[:,0],return_index=True)[1]
d=np.r_[c,len(a)]
inds=(d[1:]+d[:-1]-1)/2.0
# final result (as suggested by seberg)
medians=np.mean(np.c_[b[np.floor(inds).astype(int),1],
                      b[np.ceil(inds).astype(int),1]],1)

# inds is the index of the median value for each key

您可以根据需要缩短代码。

【讨论】：

确实比我最初想的要少。作为继续代码的一个想法：inds = np.column_stack([np.floor(inds), np.ceil(ind)]) 然后result = a[inds,1].mean(1)（或类似的东西）。
你确定这段代码返回的是中值而不是平均值吗？

【解决方案4】：

如果您发现自己非常想做这件事，我建议您查看pandas 库，它让这变得像馅饼一样简单：

>>> df = pandas.DataFrame([["A", 1], ["B", 2], ["A", 3], ["A", 4], ["B", 5]], columns=["One", "Two"])
>>> print df
  One  Two
0   A    1
1   B    2
2   A    3
3   A    4
4   B    5
>>> df.groupby('One').median()
      Two
One     
A    3.0
B    3.5

【讨论】：

谢谢！这是最好的解决方案，即使我不想在非严格需要的情况下使用库。

【解决方案5】：

一种快速的单行方法：

result = [np.median(a[a[:,0]==ii,1]) for ii in np.unique(a[:,0])]

我不相信在不牺牲准确性的情况下，您可以做些什么来加快速度。但这是另一种尝试，如果您可以跳过排序步骤，可能会更快：

num_in_ind = np.bincount(a[:,0])
results = [np.sort(a[a[:,0]==ii,1])[num_in_ind[ii]/2] for ii in np.unique(a[:,0])]

对于小型阵列，后者稍微快一些。不知道它是否足够快。

【讨论】：

我不知道为什么，但我发现“又快又脏”的一条线解决方案对我的需求来说已经足够快了。谢谢！