具有不规则和交替分箱的分箱统计答案

【问题标题】：Binned statistics with irregular and alternating bins具有不规则和交替分箱的分箱统计
【发布时间】：2019-01-07 22:57:26
【问题描述】：

这是一个更复杂的实际应用程序的简短完整示例。

使用的库：

import numpy as np
import scipy as sp
import scipy.stats as scist
import matplotlib.pyplot as plt
from itertools import zip_longest

数据：

我有一个数组，其中包含用 start 和 end 定义的不规则 bin，例如像这样（在实际情况下，这种格式是给定的，因为它是另一个进程的输出）：

bin_starts = np.array([0, 93, 184, 277, 368])
bin_ends = np.array([89, 178, 272, 363, 458])

我与之结合：

bns = np.stack(zip_longest(bin_starts, bin_ends)).flatten()
bns
>>> array([  0,  89,  93, 178, 184, 272, 277, 363, 368, 458])

给出一个规则交替的长短间隔序列，所有的长度都是不规则的。这是给定长间隔和短间隔的草图表示：

我有一堆时间序列数据，类似于下面创建的随机数据：

# make some random example data to bin
np.random.seed(45)
x = np.arange(0,460)
y = 5+np.random.randn(460).cumsum()
plt.plot(x,y);

目标：

我想使用间隔序列来收集数据的统计信息（平均值、百分位数、等） - 但只能使用长间隔，即草图中的黄色间隔。

假设和说明：

长间隔的边缘永远不会重叠；换句话说，长间隔之间总是有一个短间隔。而且，第一个间隔总是很长的。

当前解决方案：

一种方法是在所有间隔上使用scipy.stats.binned_statistic，然后将结果切片以仅保留其他间隔（即[::2]），就像这样（对某些统计数据有很大帮助，例如np.percentile，正在阅读this SO answer 由@ali_m）：

ave = scist.binned_statistic(x, y, 
                         statistic = np.nanmean, 
                         bins=bns)[0][::2]

这给了我想要的结果：

plt.plot(np.arange(0,5), ave);

问题：是否有更 Pythonic 的方式来执行此操作（使用 Numpy、Scipy 或 Pandas 中的任何一个）？

【问题讨论】：

标签： python numpy scipy statistics binning

【解决方案1】：

我认为使用IntervalIndex、pd.cut、groupby 和agg 的组合是获得您想要的东西的相对简单和简单的方法。

我会先制作 DataFrame（不确定这是否是从 np 数组获取的最佳方式）：

df = pd.DataFrame()
df['x'], df['y'] = x, y

然后您可以将您的垃圾箱定义为元组列表：

bins = list(zip(bin_starts, bin_ends))

使用具有from_tuples() 方法的pandas IntervalIndex 创建bin 以供以后在cut 中使用。这很有用，因为您不必依赖切片 bns 数组来解开“有规律地交替的长和短间隔序列”——相反，您可以显式定义您感兴趣的 bin：

ii = pd.IntervalIndex.from_tuples(bins, closed='both')

closed kwarg 指定是否在区间中包含末端成员编号。例如对于元组(0, 89)，closed='both' 的区间将包括 0 和 89（与 left、right 或 neither 相对）。

然后使用pd.cut() 在数据框中创建一个类别列，这是一种将值分箱为区间的方法。可以使用 bin kwarg 指定 IntervalIndex 对象：

df['bin'] = pd.cut(df.x, bins=ii)

最后，使用df.groupby() 和.agg() 获取您想要的任何统计信息：

df.groupby('bin')['y'].agg(['mean', np.std])

哪个输出：

                 mean       std
bin                            
[0, 89]     -4.814449  3.915259
[93, 178]   -7.019151  3.912347
[184, 272]   7.223992  5.957779
[277, 363]  15.060402  3.979746
[368, 458]  -0.644127  3.361927

【讨论】：

感谢您的回答。碰巧的是，在现实世界的情况下，我在 Pandas DataFrame 中有数据，但我决定让这个小例子更加不可知。您的解决方案看起来很简洁，它适用于小例子（我得到的情节与我上一个 plt.plot(np.arange(0,5), df.groupby('bin')['y'].agg(['mean', np.std]).loc[:, 'mean']); 的情节相同），很高兴在经过实际案例测试后选择答案。如果你有时间，你会用更多关于pandas.IntervalIndex 和pandas.cut 的解释来扩展它，因为 Pandas 可能有点吓人 - 并且参考相当枯燥（即使对于非新手，恕我直言）。
我认为这里的关键是 pandas.IntervalIndex 允许使用元组（由您的 bins = list(zip(bin_starts, bin_ends)) 定义用于 bin 边缘定义。聪明。