将时间序列写入具有调整索引的数据帧答案

【问题标题】：Writing timeseries to a dataframe with an adjusting index将时间序列写入具有调整索引的数据帧
【发布时间】：2016-08-13 17:57:17
【问题描述】：

如果我有多个 csv 文件，每个文件都包含按日期索引的时间序列数据。有没有办法创建一个包含所有数据的单个数据框，索引调整为以前文件中可能没有看到的新日期。例如说我在时间序列 1 中阅读：

03/01/2001  2.984
04/01/2001  3.016
05/01/2001  2.891
08/01/2001  2.527
09/01/2001  2.445
11/01/2001  2.648
12/01/2001  2.803
15/01/2001  2.943

数据框看起来很像上面的数据。但是如果我再读另一个文件说时间序列 2

02/01/2001  24.75
03/01/2001  24.35
04/01/2001  25.1
08/01/2001  23.5
09/01/2001  23.6
10/01/2001  24.5
11/01/2001  24.7
12/01/2001  24.4

您可以看到时间序列 1 具有 05/01/2001 的值，而时间序列 2 没有。时间序列 2 也有 02/01/2001 和 10/01/2001 的数据点。那么有没有办法得到以下结果：

02/01/2001  null    24.75 ..etc
03/01/2001  2.984   24.35 ..etc
04/01/2001  3.016   25.1  ..etc
05/01/2001  2.891   null  ..etc
08/01/2001  2.527   23.5  ..etc
09/01/2001  2.445   23.6  ..etc
10/01/2001  null    24.5  ..etc
11/01/2001  2.648   24.7  ..etc
12/01/2001  2.803   24.4  ..etc
15/01/2001  2.943   null  ..etc

索引针对新日期进行调整的位置以及没有当天数据的任何时间序列设置为 null 或某个此类值？

到目前为止，我的代码相当基本，我可以遍历一个目录并打开 .csv 文件并将它们准备成一个数据框，但我不知道如何以上述方式将数据框组合在一起。

    def getTimeseriesData(DataPath,columnNum,startDate,endDate):
        #print('startDate: ',startDate,' endDate: ',endDate)
        colNames = ['date']

        path = DataPath
        print('DataPath: ',DataPath)
        filePath = path, "*.csv"
        allfiles = glob.glob(os.path.join(path, "*.csv"))
        for fname in allfiles:
            name = os.path.splitext(fname)[0]
            name = os.path.split(name)[1]

            colNames.append(name)

        dataframes = [pd.read_csv(fname, header=None,usecols=[0,columnNum]) for fname in allfiles]
#not sure of the next bit

【问题讨论】：

标签： python python-3.x pandas dataframe

【解决方案1】：

pd.concat 可用于连接具有不同索引的 DataFrame。例如，

df1 = pd.DataFrame({'A': list('ABCDE')}, index=range(5))
df2 = pd.DataFrame({'B': list('ABCDE')}, index=range(2,7))
pd.concat([df1, df2], axis=1)

产量

     A    B
0    A  NaN
1    B  NaN
2    C    A
3    D    B
4    E    C
5  NaN    D
6  NaN    E

注意df1和df2的索引是对齐的并且使用了NaN 哪里有缺失值。

所以在你的情况下，如果你使用

pd.read_csv(fname, header=None, usecols=[0,column_num], parse_dates=[0],
            index_col=[0], names=['date', name]))

index_col=[0] 将使第一列成为 DataFrame 的索引，以便稍后调用

dfs = pd.concat(dfs, axis=1)

将生成一个 DataFrame，其中所有 DataFrame 都根据日期对齐。

将data1.csv 和data2.csv 放在~/tmp 中，

import glob
import os
import pandas as pd

def get_timeseries_data(path, column_num):
    colNames = ['date']
    dfs = []
    allfiles = glob.glob(os.path.join(path, "*.csv"))
    for fname in allfiles:
        name = os.path.splitext(fname)[0]
        name = os.path.split(name)[1]
        colNames.append(name)
        df = pd.read_csv(fname, header=None, usecols=[0, column_num], 
                        parse_dates=[0], dayfirst=True,
                        index_col=[0], names=['date', name])

        # aggregate rows with duplicate index by taking the mean
        df = df.groupby(level=0).agg('mean')

        # alternatively, drop rows with duplicate index
        # http://stackoverflow.com/a/34297689/190597 (n8yoder)
        # df = df[~df.index.duplicated(keep='first')]

        dfs.append(df)
    dfs = pd.concat(dfs, axis=1)
    return dfs

path = os.path.expanduser('~/tmp/tmp')
column_num = 1
dfs = get_timeseries_data(path, column_num)
print(dfs)

产量

            data1  data2
date                    
2001-01-02    NaN  24.75
2001-01-03  2.984  24.35
2001-01-04  3.016  25.10
2001-01-05  2.891    NaN
2001-01-08  2.527  23.50
2001-01-09  2.445  23.60
2001-01-10    NaN  24.50
2001-01-11  2.648  24.70
2001-01-12  2.803  24.40
2001-01-15  2.943    NaN

【讨论】：

我已经实现了上面的代码，但是收到一个错误'InvalidIndexError: Reindexing only valid with uniquely valueed Index objects'，为什么会这样？
dfs 中的 DataFrame 可能有不止一行与同一日期相关联。如果是这样，您必须决定如何处理重复的日期。例如，您可以简单地删除除一个重复日期之外的所有日期。（我已经编辑了上面的帖子以显示如何。）
或者，您可以使用groupby/agg 将具有重复日期的行聚合到一行中。或者，如果您想跨多个 DataFrame 传播重复索引，则需要使用 pd.merge 而不是 pd.concat。这将需要一个看起来更像flyingmeatball's 的解决方案。

【解决方案2】：

也许不是最优雅的，但我会创建一个从所有 csv 文件的最小日期到最大日期的时间序列索引，称该数据框为 df，然后执行 df['file1']=pd.read_csv( 'file1.csv'）。然后，您将拥有一些全为 NaN 的行，您可以过滤这些行并将其删除。

【讨论】：

【解决方案3】：

使用合并尝试类似的操作。

df1 = pd.DataFrame([['03/01/2001', 2.984],['04/01/2001', 3.016],['05/01/2001',2.891],['08/01/2001', 2.527],
       ['09/01/2001', 2.445],['11/01/2001',2.648],
       ['12/01/2001', 2.803],['15/01/2001',2.943]], columns = ['date','field'])

df2 = pd.DataFrame([['02/01/2001',  24.75],['03/01/2001',  24.35],['04/01/2001', 25.1],['08/01/2001',  23.5],
       ['09/01/2001',  23.6], ['10/01/2001',  24.5],['11/01/2001',  24.7],['12/01/2001',  24.4]], columns = ['date','field'])

#files in your directory
files= [df1,df2]

fileNo = 1
for currFile in files:
    if fileNo ==1:
        df = currFile
    else:
        currFile.rename(columns = {'field':'field_fromFile_' + str(fileNo)})
        df = pd.merge(df, currFile, how = 'outer',left_on = 'date',right_on = 'date')
    fileNo =fileNo + 1

【讨论】：