Python 错误中的 OLS 滚动回归 - IndexError：索引超出范围答案

【问题标题】：OLS Rolling regression in Python Error - IndexError: index out of boundsPython 错误中的 OLS 滚动回归 - IndexError：索引超出范围
【发布时间】：2017-07-03 13:23:48
【问题描述】：

对于我的评估，我想为this link (https://drive.google.com/drive/folders/0B2Iv8dfU4fTUMVFyYTEtWXlzYkk) 中找到的数据集运行 3 个窗口 OLS regression estimation 的滚动示例，格式如下。我数据集中的第三列 (Y) 是我的真实值 - 这就是我想要预测（估计）的值。

 time     X   Y
0.000543  0  10
0.000575  0  10
0.041324  1  10
0.041331  2  10
0.041336  3  10
0.04134   4  10
  ...
9.987735  55 239
9.987739  56 239
9.987744  57 239
9.987749  58 239
9.987938  59 239

使用简单的OLS regression estimation，我用下面的脚本试了一下。

# /usr/bin/python -tt

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

df = pd.read_csv('estimated_pred.csv')

model = pd.stats.ols.MovingOLS(y=df.Y, x=df[['X']], 
                               window_type='rolling', window=3, intercept=True)
df['Y_hat'] = model.y_predict

print(df['Y_hat'])
print (model.summary)
df.plot.scatter(x='X', y='Y', s=0.1)

但是，使用statsmodels 或scikit-learn 似乎是超越简单回归的不错选择。我尝试使用statsmodels 使以下脚本工作，但使用attached 数据集的更高子集返回IndexError: index out of bounds（例如，对于超过1000 行的数据集）。

# /usr/bin/python -tt
import pandas as pd
import numpy as np
import statsmodels.api as sm


df=pd.read_csv('estimated_pred.csv')    
df=df.dropna() # to drop nans in case there are any
window = 3
#print(df.index) # to print index
df['a']=None #constant
df['b1']=None #beta1
df['b2']=None #beta2
for i in range(window,len(df)):
    temp=df.iloc[i-window:i,:]
    RollOLS=sm.OLS(temp.loc[:,'Y'],sm.add_constant(temp.loc[:,['time','X']])).fit()
    df.iloc[i,df.columns.get_loc('a')]=RollOLS.params[0]
    df.iloc[i,df.columns.get_loc('b1')]=RollOLS.params[1]
    df.iloc[i,df.columns.get_loc('b2')]=RollOLS.params[2]

#The following line gives us predicted values in a row, given the PRIOR row's estimated parameters
df['predicted']=df['a'].shift(1)+df['b1'].shift(1)*df['time']+df['b2'].shift(1)*df['X']

print(df['predicted'])
#print(df['b2'])

#print(RollOLS.predict(sm.add_constant(predict_x)))

print(temp)

最后，我想做一个Y的预测（即根据X的前3个滚动值预测Y的当前值。我们如何使用statsmodels或@987654337来做到这一点@ for pd.stats.ols.MovingOLS 已在 Pandas 版本 0.20.0 中删除，因为我找不到任何参考？

【问题讨论】：

你能报告错误的完整跟踪吗？
当然。这是错误的完整跟踪。 File 用于换行：Traceback (most recent call last): File "../Desktop/rolling_regression/rolling_regression2.py", line 26, in <module> df.iloc[i,df.columns.get_loc('b2')]=RollOLS.params[2] File "../anaconda/lib/python3.5/site-packages/pandas/indexes/base.py", line 1986, in get_value return tslib.get_value_box(s, key) File "pandas/tslib.pyx", line 777, in pandas.tslib.get_value_box (pandas/tslib.c:17017) File "pandas/tslib.pyx", line 793, in pandas.tslib.get_value_box (pandas/tslib.c:16774) IndexError: index out of bounds
看起来对 sm.OLS 的调用成功了。请检查/显示 RollOls.params 以确保它实际上有 3 个条目。
是的。它实际上只适用于数据集的几行（例如：500 行）-IndexError: index out of bounds 错误发生在我尝试使用更高的数据集子集（比如说 1000）时。
我可以建议使用 %debug 和 "u" 去跟踪，这样你就可以看到错误是否发生在你这样做：RollOLS.params[2] 或当你做 df. iloc[i,df.columns.get_loc('b2')] 。无论如何，错误发生在循环的最后一行，这是与使用错误索引访问有关的错误，与 sm.OLS 无关

标签： python python-3.x numpy scikit-learn statsmodels

【解决方案1】：

我想我发现了你的问题：从sm.add_constant 的documentation 中，有一个名为has_constant 的参数需要设置为add（默认为skip）。

has_constant : str {'raise', 'add', 'skip'} ``data'' 已经有一个常量时的行为。默认将返回数据而不添加另一个常数。如果'raise'，将提高一个如果存在常量则错误。使用“添加”将复制常数，如果存在的话。对结构化或重新排列。在这种情况下，不检查常量。

基本上，对于循环的迭代，您的变量 time 在子集中是常量，因此该函数没有添加常量，因此 RollOLS.params 只有 2 个条目。

temp
Out[12]: 
        time   X     Y      a           b1           b2
541  0.16182  13  20.0  19.49      3.15289 -1.26116e-05
542  0.16182  14  20.0     20            0  7.10543e-15
543  0.16182  15  20.0     20 -7.45058e-09            0

sm.add_constant(temp.loc[:,['time','X']])
Out[13]: 
        time   X
541  0.16182  13
542  0.16182  14
543  0.16182  15

sm.add_constant(temp.loc[:,['time','X']], has_constant = 'add')
Out[14]: 
     const     time   X
541      1  0.16182  13
542      1  0.16182  14
543      1  0.16182  15

因此，如果您在 sm.add_constant 函数中有 has_constant = 'add'，那么错误就会消失，但是在解释变量中会有两个线性相关列，这使得矩阵不可逆，因此回归没有意义。

【讨论】：

谢谢 FLab。我仍然不明白为什么它只适用于例如 100 行数据集而没有错误。 but you would have two linearly dependent columns in the explanatory variables, which makes the matrix not invertible hence the regression would not make sense 是什么意思？我认为df['a'] 是我脚本中的常量。
我认为在索引 541-543 是第一次时间在 3 次观察中保持不变。关于第二点，看一下 sn -p 代码的最后输出。基本上你有时间 = 0.16182 * const，所以你的矩阵的秩是 2（不是 3）。这个问题被称为多重共线性（完美，在这种情况下）：en.wikipedia.org/wiki/Multicollinearity
啊哈，完美，谢谢。当我们执行print(temp) 时，它只打印最后 3 个预测，我们想打印所有预测怎么样？
你的意思是到目前为止的所有预测吗？您可以从 df 检索它们
很高兴它有帮助。一般来说，使用 OLS，您的回归是：y = 常数 + b1 * 时间 + b2 * x。因此，如果您有一个新观察值 (t*, x*)，您可以使用您使用 OLS 估计的系数，通过将值插入公式中来估计值 y*。现在你正在做一个滚动回归，所以你重新估计了系数。您通常这样做是为了查看 Y、时间和 X 之间的关系是否随时间变化。附带说明一下，对我来说，3 个观察结果看起来太短了，无法捕捉到稳定的关系。