【问题标题】:Rolling OLS using time as the independent variable with pandas使用时间作为自变量与 pandas 滚动 OLS
【发布时间】:2019-08-21 21:47:50
【问题描述】:

我正在尝试使用数据框架/股票价格时间序列在 pandas 中构建滚动 OLS 模型。我想要做的是在过去 N 天执行 OLS 计算并返回预测的价格和斜率,并将它们添加到数据框中各自的列中。据我所知,我唯一的选择是使用pyfinance 中的PandasRollingOLS,所以我将在我的示例中使用它,但如果有其他方式,我很乐意使用它。

例如,我的数据框如下所示:

Date                     Price
....
2019-03-31 08:59:59.999  1660
2019-03-31 09:59:59.999  1657
2019-03-31 10:59:59.999  1656
2019-03-31 11:59:59.999  1652
2019-03-31 12:59:59.999  1646
2019-03-31 13:59:59.999  1645
2019-03-31 14:59:59.999  1650
2019-03-31 15:59:59.999  1669
2019-03-31 16:59:59.999  1674

我想使用Date 列作为自变量执行滚动回归。通常我会这样做:

X = df['Date']
y = df['Price']
model = ols.PandasRollingOLS(y, X, window=250)

但是,毫不奇怪,使用 df['Date'] 作为我的 X 返回错误。

所以我的第一个问题是,我需要对我的Date 列做什么才能使PandasRollingOLS 正常工作。我的下一个问题是我到底需要调用什么来返回预测值和斜率?对于常规的OLS,我会执行model.predictmodel.slope 之类的操作,但这些选项显然不适用于PandasRollingOLS

我实际上想将这些值添加到我的 df 中的新列中,所以我想像 df['Predict'] = model.predict 这样的东西,但显然这不是答案。 df 的理想结果是这样的:

Date                     Price  Predict  Slope
....
2019-03-31 08:59:59.999  1660   1665     0.10
2019-03-31 09:59:59.999  1657   1663     0.10
2019-03-31 10:59:59.999  1656   1661     0.09
2019-03-31 11:59:59.999  1652   1658     0.08
2019-03-31 12:59:59.999  1646   1651     0.07
2019-03-31 13:59:59.999  1645   1646     0.07
2019-03-31 14:59:59.999  1650   1643     0.07
2019-03-31 15:59:59.999  1669   1642     0.07
2019-03-31 16:59:59.999  1674   1645     0.08

非常感谢任何帮助,干杯。

【问题讨论】:

    标签: python pandas regression linear-regression


    【解决方案1】:

    您可以使用 datetime.datetime.strptimetime.mktime 将日期转换为整数,然后使用 statsmodels 和自定义函数来处理滚动窗口,为您的数据框的所需子集构建模型:

    输出:

                             Price      Predict     Slope
    Date                                                 
    2019-03-31 10:59:59.999   1656  1657.670504  0.000001
    2019-03-31 11:59:59.999   1652  1655.003830  0.000001
    2019-03-31 12:59:59.999   1646  1651.337151  0.000001
    2019-03-31 13:59:59.999   1645  1647.670478  0.000001
    2019-03-31 14:59:59.999   1650  1647.003818  0.000001
    2019-03-31 15:59:59.999   1669  1654.670518  0.000001
    2019-03-31 16:59:59.999   1674  1664.337207  0.000001
    

    代码:

    #%%
    # imports
    import datetime, time
    import pandas as pd
    import numpy as np
    import statsmodels.api as sm
    from collections import OrderedDict
    
    # your data in a more easily reprodicible format
    data = {'Date': ['2019-03-31 08:59:59.999', '2019-03-31 09:59:59.999', '2019-03-31 10:59:59.999',
            '2019-03-31 11:59:59.999',  '2019-03-31 12:59:59.999', '2019-03-31 13:59:59.999',
            '2019-03-31 14:59:59.999', '2019-03-31 15:59:59.999', '2019-03-31 16:59:59.999'],
            'Price': [1660, 1657, 1656, 1652, 1646, 1645, 1650, 1669, 1674]}
    
    # function to make a useful time structure as independent variable
    def myTime(date_time_str):
        date_time_obj = datetime.datetime.strptime(date_time_str, '%Y-%m-%d %H:%M:%S.%f')
        return(time.mktime(date_time_obj.timetuple()))
    
    # add time structure to dataset
    data['Time'] = [myTime(obs) for obs in data['Date']]
    
    # time for pandas
    df = pd.DataFrame(data)
    
    # Function for rolling OLS of a desired window size on a pandas dataframe
    
    def RegressionRoll(df, subset, dependent, independent, const, win):
        """
        RegressionRoll takes a dataframe, makes a subset of the data if you like,
        and runs a series of regressions with a specified window length, and
        returns a dataframe with BETA or R^2 for each window split of the data.
    
        Parameters:
        ===========
        df -- pandas dataframe
        subset -- integer - has to be smaller than the size of the df or 0 if no subset.
        dependent -- string that specifies name of denpendent variable
        independent -- LIST of strings that specifies name of indenpendent variables
        const -- boolean - whether or not to include a constant term
        win -- integer - window length of each model
    
        Example:
        ========
        df_rolling = RegressionRoll(df=df, subset = 0, 
                                    dependent = 'Price', independent = ['Time'],
                                    const = False, win = 3)
    
        """
    
        # Data subset
        if subset != 0:
            df = df.tail(subset)
        else:
            df = df
    
        # Loopinfo
        end = df.shape[0]+1
        win = win
        rng = np.arange(start = win, stop = end, step = 1)
    
        # Subset and store dataframes
        frames = {}
        n = 1
    
        for i in rng:
            df_temp = df.iloc[:i].tail(win)
            newname = 'df' + str(n)
            frames.update({newname: df_temp})
            n += 1
    
        # Analysis on subsets
        df_results = pd.DataFrame()
        for frame in frames:
    
        #debug
        #print(frames[frame])
    
        # Rolling data frames
        dfr = frames[frame]
        y = dependent
        x = independent
    
        # Model with or without constant
        if const == True:
            x = sm.add_constant(dfr[x])
            model = sm.OLS(dfr[y], x).fit()
        else:
            model = sm.OLS(dfr[y], dfr[x]).fit()
    
        # Retrieve price and price prediction
        Prediction = model.predict()[-1]
        d = {'Price':dfr['Price'].iloc[-1], 'Predict':Prediction}
        df_prediction = pd.DataFrame(d, index = dfr['Date'][-1:])
    
        # Retrieve parameters (constant and slope, or slope only)
        theParams = model.params[0:]
        coefs = theParams.to_frame()
        df_temp = pd.DataFrame(coefs.T)
        df_temp.index = dfr['Date'][-1:]
    
        # Build dataframe with Price, Prediction and Slope (+constant if desired)
        df_temp2 = pd.concat([df_prediction, df_temp], axis = 1)
        df_temp2=df_temp2.rename(columns = {'Time':'Slope'})
        df_results = pd.concat([df_results, df_temp2], axis = 0)
    
    return(df_results)
    
    # test run
    df_rolling = RegressionRoll(df=df, subset = 0, 
                                dependent = 'Price', independent = ['Time'],
                                const = False, win = 3)
    print(df_rolling)
    

    通过不指定这么多变量,而是将更多表达式直接放入字典和函数中,可以轻松缩短代码,但我们可以看看生成的输出是否确实代表了您想要的输出。另外,您没有指定是否在分析中包含常数项,因此我也包含了一个选项来处理它。

    【讨论】:

    • @top bantz 很高兴为您提供帮助!关于滚动回归的类似但不相同的问题不时出现。您问题中最有趣的部分之一是如何构建所需的输出。您还可以查看我的帖子 Statsmodels OLS with rolling window problem,了解更广泛的方法来应对您的挑战,其中包括其他参数的选项,例如 R^2。
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2020-09-14
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2016-10-31
    • 2019-09-08
    • 1970-01-01
    相关资源
    最近更新 更多