在 Pandas 中将 rolling_apply 与需要 2 个参数的函数一起使用答案

【问题标题】：Using rolling_apply with a function that requires 2 arguments in Pandas在 Pandas 中将 rolling_apply 与需要 2 个参数的函数一起使用
【发布时间】：2014-10-29 18:47:23
【问题描述】：

我正在尝试将 rollapply 与需要 2 个参数的公式一起使用。据我所知，计算 kendall tau 相关性（包括标准平局校正）的唯一方法（除非您从头开始创建公式）是：

>>> import scipy
>>> x = [5.05, 6.75, 3.21, 2.66]
>>> y = [1.65, 26.5, -5.93, 7.96]
>>> z = [1.65, 2.64, 2.64, 6.95]
>>> print scipy.stats.stats.kendalltau(x, y)[0]
0.333333333333

我也知道 rollapply 的问题并采用两个参数，如此处所述：

不过，我仍在努力寻找一种方法来在滚动的基础上对具有多列的数据框进行 kendalltau 计算。

我的数据框是这样的

A = pd.DataFrame([[1, 5, 1], [2, 4, 1], [3, 3, 1], [4, 2, 1], [5, 1, 1]], 
                 columns=['A', 'B', 'C'], index = [1, 2, 3, 4, 5])

尝试创建一个执行此操作的函数

In [1]:function(A, 3)  # A is df, 3 is the rolling window
Out[2]:
   A  B  C     AB     AC     BC  
1  1  5  2    NaN    NaN    NaN
2  2  4  4    NaN    NaN    NaN
3  3  3  1  -0.99  -0.33   0.33
4  4  2  2  -0.99  -0.33   0.33
5  5  1  4  -0.99   0.99  -0.99

在一个非常初步的方法中，我接受了这样定义函数的想法：

def tau1(x):
    y = np.array(A['A']) #  keep one column fix and run it in the other two
    tau, p_value = sp.stats.kendalltau(x, y)
    return tau

 A['AB'] = pd.rolling_apply(A['B'], 3, lambda x: tau1(x))

当然没用。我得到了：

ValueError: all keys need to be the same shape

我明白这不是一个小问题。我很感激任何意见。

【问题讨论】：

标签： python numpy pandas scipy dataframe

【解决方案1】：

As of Pandas 0.14, rolling_apply 只将 NumPy 数组传递给函数。一种可能的解决方法是将np.arange(len(A)) 作为第一个参数传递给rolling_apply，以便tau 函数接收您希望使用的行的索引。然后在tau函数内，

B = A[[col1, col2]].iloc[idx]

返回一个包含所有所需行的 DataFrame。

import numpy as np
import pandas as pd
import scipy.stats as stats
import itertools as IT

A = pd.DataFrame([[1, 5, 2], [2, 4, 4], [3, 3, 1], [4, 2, 2], [5, 1, 4]], 
                 columns=['A', 'B', 'C'], index = [1, 2, 3, 4, 5])

for col1, col2 in IT.combinations(A.columns, 2):
    def tau(idx):
        B = A[[col1, col2]].iloc[idx]
        return stats.kendalltau(B[col1], B[col2])[0]
    A[col1+col2] = pd.rolling_apply(np.arange(len(A)), 3, tau)

print(A)

产量

   A  B  C  AB        AC        BC
1  1  5  2 NaN       NaN       NaN
2  2  4  4 NaN       NaN       NaN
3  3  3  1  -1 -0.333333  0.333333
4  4  2  2  -1 -0.333333  0.333333
5  5  1  4  -1  1.000000 -1.000000

【讨论】：

太棒了。非常感谢！。我应该记住的列数有限制吗？这些 itertools 功能非常棒，远远超出我的水平......可以提出任何其他智能问题。
组合的数量像n**2 一样增长，所以tau 的调用顺序是n**2 * m 次m = len(A)。所以这可能需要一段时间，特别是如果你有很多列。 Using itertools 真的很好玩；学习它并不难，值得花时间。
60K 行 x 4 列 ~ 7 分钟
这对我帮助很大，谢谢！由于rolling_apply 将被弃用，您是否考虑将您的解决方案更新为pd.Series(np.arange(len(A)), index=A.index).rolling(3).apply(tau)？