《数据分析实战-托马兹.卓巴斯》读书笔记第7章-时间序列技术（ARMA模型、ARIMA模型）

第7章探索了如何处理和理解时间序列数据，并建立ARMA模型以及ARIMA模型。注意：我在本章花的时间较长，主要是对dataframe结构不熟。

本章会介绍处理、分析和预测时间序列数据的各种技术。会学习以下技巧：
·在Python中如何处理日期对象
·理解时间序列数据
·*滑并转换观测值
·过滤时间序列数据
·移除趋势和季节性
·使用ARMA和ARIMA模型预测未来

7.1导论

时间序列随处可见；如果分析股票市场、太阳黑子，或河流，你就是在观察随时间延展的现象。数据科学家在职业生涯中处理时间序列数据总是不可避免的。本章中，我们会遇到对时间序列进行处理、分析和构建模型的多种技巧。
本章中用到的数据集来自网上河流的Web文档：http://ftp.uni-bayreuth.de/math/statlib/datasets/riverflow。这个文档本质上就是一个shell脚本，为本章创建数据集。要从文档中创建原始文件，你可以使用Windows下的Cygwin或者Mac/Linux下的Terminal，执行下述命令（假设你将文档保存在riverflows.webarchive）：

/*  
sh riverflows.webarchive
*/

邀月建议：安装cygwin巨麻烦，还是用安装好的CentOS虚拟机执行一下。

7.2在Python中如何处理日期对象

时间序列是以某个时间间隔进行采样得到的数据，例如，记录每秒的车速。拿到这样的数据，我们可以轻松估算经过的距离（假设观测值加总并除以3600）或者汽车的加速度（计算两个观测值之间的差异）。可以直接用pandas处理时间序列数据。
准备：需装好pandas、NumPy和Matplotlib。

步骤：从Web文档开始，我们进行清理，并形成两个数据集：美国河（http://www.theameri-canriver.com）和哥伦比亚河（http://www.ecy.wa.gov/programs/wr/cwp/cwpfactmap.html）。用pandas读取时间序列数据集很简单（ts_handlingData.py文件）：

 1 import numpy as np
 2 import pandas as pd
 3 import pandas.tseries.offsets as ofst
 4 import matplotlib
 5 import matplotlib.pyplot as plt
 6 
 7 # change the font size
 8 matplotlib.rc('xtick', labelsize=9)
 9 matplotlib.rc('ytick', labelsize=9)
10 matplotlib.rc('font', size=14)
11 
12 # files we'll be working with
13 files=['american.csv', 'columbia.csv']
14 
15 # folder with data
16 data_folder = '../../Data/Chapter07/'
17 
18 # colors
19 colors = ['#FF6600', '#000000', '#29407C', '#660000']
20 
21 # read the data
22 american = pd.read_csv(data_folder + files[0],
23     index_col=0, parse_dates=[0],
24     header=0, names=['','american_flow'])
25 
26 columbia = pd.read_csv(data_folder + files[1],
27     index_col=0, parse_dates=[0],
28     header=0, names=['','columbia_flow'])
29 
30 # combine the datasets
31 riverFlows = american.combine_first(columbia)
32 
33 # periods aren't equal in the two datasets so find the overlap
34 # find the first month where the flow is missing for american
35 idx_american = riverFlows \
36     .index[riverFlows['american_flow'].apply(np.isnan)].min()
37 
38 # find the last month where the flow is missing for columbia
39 idx_columbia = riverFlows \
40     .index[riverFlows['columbia_flow'].apply(np.isnan)].max()
41 
42 # truncate the time series
43 riverFlows = riverFlows.truncate(
44     before=idx_columbia + ofst.DateOffset(months=1),
45     after=idx_american - ofst.DateOffset(months=1))

Tips:

/*
 Traceback (most recent call last):
  File "D:\Java2018\practicalDataAnalysis\Codes\Chapter07\ts_handlingData.py", line 49, in <module>
    o.write(riverFlows.to_csv(ignore_index=True))
TypeError: to_csv() got an unexpected keyword argument 'ignore_index'

D:\Java2018\practicalDataAnalysis\Codes\Chapter07\ts_handlingData.py:80: FutureWarning: how in .resample() is deprecated
the new syntax is .resample(...).mean()
  year = riverFlows.resample('A', how='mean')
 
 */

解决方案：

/*
# year = riverFlows.resample('A', how='mean')
year = riverFlows.resample('A').mean()

# o.write(riverFlows.to_csv(ignore_index=True))
  o.write(riverFlows.to_csv(index=True))
 */

原理：首先，我们引入所有必需的模块：pandas和NumPy。我们将得到两个文件：american.csv和columbia.csv。它们都位于data_folder。
我们使用已经熟悉的pandas的.read_csv（...）方法。先读入american.csv文件。指定index_col=0，让方法将第一列作为索引。要让pandas将某列当成日期处理，我们显式地命令.read_csv（...）方法将列作为日期解析（parse_dates）。
将两个文件合并成一个数据集。然后改动列名：我们告诉方法，这里没有头部，并且将自行提供名字。注意第一列不需要任何名字，它将被转换成索引。我们用同样的方式读入哥伦比亚河的数据。
读入两个文件后，将它们联系在一起。pandas的.combine_first（...）方法操作第一个数据集，插入哥伦比亚河数据集的列。
如果没有改变DataFrame的列名，.combine_first（...）方法将使用被调用DataFrame的数据来填充调用者DataFrame的空隙。
两个文件的时期不同，但是有重叠部分：美国河数据从1906年到1960年，哥伦比亚河数据从1933年到1969年。查看一下重叠的时期；我们连接的数据集只有1933年到1960年的数据。
首先，找到美国河没有数据的最早日期（american_flow列）。检查riverFlows的索引，选取american_flow值不是数字的所有日期；使用.apply（...）方法并使用NumPy的.isnan方法检查DataFrame的元素。做完这个之后，选取序列中的最小日期。
而对于columbia_flow，我们找的是没有数据的最晚日期。和处理美国河数据类似，我们先取出所有数据不是数字的日期，然后选取最大值。
.truncate（...）方法可以根据DatetimeIndex从DataFrame中移除数据。
DatetimeIndex是数字组成的不可变数组。内部由大整数表示，但看上去是日期-时间对象：既有日期部分也有时间部分的对象。

参考http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DatetimeIndex.html。
我们向.truncate（...）方法传递两个参数。before参数指定了要舍弃哪个日期之前的记录，after参数指定了保留数据的最后一个日期。
idx_...对象保存了至少有一列没有数据的日期的最小值和最大值。不过，如果将这些日期传入.truncate（...）方法，我们也会选出同样没有数据的极值点。应对这种情况，我们用.DateOffset（...）方法将日期移个位。我们只挪一个月。
如果想更深入了解.DateOffset（...）方法，可参考http://pandas.pydata.org/pandas-docs/stable/timeseries.html#dateoffset-objects。
最后将连接后的数据集保存到文件。（更多信息参考本书1.2节）

/*
Index of riverFlows
DatetimeIndex(['1933-01-31', '1933-02-28', '1933-03-31', '1933-04-30',
               '1933-05-31', '1933-06-30', '1933-07-31', '1933-08-31',
               '1933-09-30', '1933-10-31',
               ...
               '1960-03-31', '1960-04-30', '1960-05-31', '1960-06-30',
               '1960-07-31', '1960-08-31', '1960-09-30', '1960-10-31',
               '1960-11-30', '1960-12-31'],
              dtype='datetime64[ns]', name='', length=336, freq=None)

csv_read['1933':'1934-06']
            american_flow  columbia_flow
                                        
1933-01-31        10.7887          24.10
1933-02-28        14.6115          20.81
1933-03-31        19.6236          22.96
1933-04-30        21.9739          37.66
1933-05-31        28.0054         118.93
1933-06-30        66.0632         331.31
1933-07-31       113.4373         399.27
1933-08-31       162.0007         250.89
1933-09-30       156.6771         116.10
1933-10-31        17.9246          69.38
1933-11-30         7.0792          52.95
1933-12-31         4.0493          40.21
1934-01-31        11.5816          35.40
1934-02-28        18.5192          28.88
1934-03-31        53.8586          33.41
1934-04-30        75.8608         102.22
1934-05-31        89.3963         259.67
1934-06-30       116.2973         390.77

Shifting one month forward
            american_flow  columbia_flow
                                        
1933-02-28        10.7887          24.10
1933-03-31        14.6115          20.81
1933-04-30        19.6236          22.96
1933-05-31        21.9739          37.66
1933-06-30        28.0054         118.93
1933-07-31        66.0632         331.31

Shifting one year forward
            american_flow  columbia_flow
                                        
1934-01-31        10.7887          24.10
1934-02-28        14.6115          20.81
1934-03-31        19.6236          22.96
1934-04-30        21.9739          37.66
1934-05-31        28.0054         118.93
1934-06-30        66.0632         331.31

Averaging by quarter
            american_flow  columbia_flow
                                        
1933-03-31      15.007933      22.623333
1933-06-30      38.680833     162.633333

Averaging by half a year
            american_flow  columbia_flow
                                        
1933-01-31      10.788700      24.100000
1933-07-31      43.952483     155.156667

Averaging by year
            american_flow  columbia_flow
                                        
1933-12-31      51.852875     123.714167
1934-12-31      44.334742     128.226667 */

View Code