根据时间戳将数据帧拆分为多个子数据帧答案

【问题标题】：Split dataframe into many sub-dataframes based on timestamp根据时间戳将数据帧拆分为多个子数据帧
【发布时间】：2020-06-01 14:35:38
【问题描述】：

我有一个大的 csv，格式如下：

timestamp,name,age
2020-03-01 00:00:01,nick
2020-03-01 00:00:01,john
2020-03-01 00:00:02,nick
2020-03-01 00:00:02,john
2020-03-01 00:00:04,peter
2020-03-01 00:00:05,john
2020-03-01 00:00:10,nick
2020-03-01 00:00:12,john
2020-03-01 00:00:54,hank
2020-03-01 00:01:03,peter

我将 csv 加载到数据框中：

df = pd.read_csv("/home/test.csv")

然后我想每 2 秒创建多个数据帧。例如：

df1 包含：

2020-03-01 00:00:01,nick
2020-03-01 00:00:01,john
2020-03-01 00:00:02,nick
2020-03-01 00:00:02,john

df2 包含：

2020-03-01 00:00:04,peter
2020-03-01 00:00:05,john

等等。

我用下面的命令实现了分割时间戳：

full_idx = pd.date_range(start=df['timestamp'].min(), end = df['timestamp'].max(), freq ='0.2T')

但是我如何存储这些吐出的数据帧？如何将基于时间戳的数据集拆分为多个数据帧？

【问题讨论】：

你想如何存储它？在字典中？
我想存储在一个列表中

标签： python pandas python-2.7

【解决方案1】：

这个问题可能对我们有帮助：Pandas: Timestamp index rounding to the nearest 5th minute

import numpy as np
import pandas as pd

df = pd.read_csv("test.csv")
df['timestamp'] = pd.to_datetime(df['timestamp'])

ns2sec=2*1000000000   # 2 seconds in nanoseconds 
# next we round our timestamp to every 2nd second with rounding down
timestamp_rounded = df['timestamp'].astype(np.int64) // ns2sec
df['full_idx'] = pd.to_datetime(((timestamp_rounded - timestamp_rounded % 2) * ns2sec))

# store array for each unique value of your idx
store_array = []
for value in df['full_idx'].unique():
    store_array.append(df[df['full_idx']==value][['timestamp', 'name', 'age']])

【讨论】：

感谢您的回答。为什么要使用 .unique()？
@e7lT2P 我们有新列，其值为 2020-03-01 00:00:00、2020-03-01 00:00:02、2020-03-01 00:00:04 等等上。我们希望将您的 df 拆分一次，因为您的 dict 只有一个 df1、一个 df2 等等。我们通过获取唯一值并使用过滤器df['full_idx']==value 来做到这一点

【解决方案2】：

.resample()怎么样？

#first loading your data
>>> import pandas as pd
>>>
>>> df = pd.read_csv('dates.csv', index_col='timestamp', parse_dates=True)
>>> df.head()
                      name  age
timestamp
2020-03-01 00:00:01   nick  NaN
2020-03-01 00:00:01   john  NaN
2020-03-01 00:00:02   nick  NaN
2020-03-01 00:00:02   john  NaN
2020-03-01 00:00:04  peter  NaN

#resampling it at a frequency of 2 seconds
>>> resampled = df.resample('2s')
>>> type(resampled)
<class 'pandas.core.resample.DatetimeIndexResampler'>

#iterating over the resampler object and storing the sliced dfs in a dictionary
>>> df_dict = {}
>>> for i, (timestamp,df) in enumerate(resampled):
>>>     df_dict[i] = df
>>> df_dict[0]
                     name  age
timestamp
2020-03-01 00:00:01  nick  NaN
2020-03-01 00:00:01  john  NaN

现在解释一下……

resample() 非常适合根据时间重新组合 DataFrames（我经常使用它来对时间序列数据进行下采样），但它可以简单地用于切割 DataFrame，如您所愿。遍历df.resample()产生的resampler对象返回一个元组（name of the bin,df corresponding to that bin）：例如第一个元组是（第一秒的时间戳，前2秒对应的数据）。所以为了得到DataFrames，我们可以循环这个对象并将它们存储在某个地方，比如dict。

请注意，这将产生每 2 秒的间隔，从数据的开始到结束，因此很多数据都是空的。但如果需要，您可以添加一个步骤来过滤掉这些内容。

此外，您可以手动将每个切片的 DataFrame 分配给一个变量，但这会很麻烦（您可能需要为每 2 秒的 bin 编写一行，而不是一个小循环）。与dictionary 不同，您仍然可以将每个DataFrame 与一个可调用的名称相关联。您也可以使用 OrderedDict 或 list 或任何集合。

脚本中有几点：

将freq设置为“0.2T”为12秒（.2 *60）；你可以宁愿做freq="2s"
示例df 和df2 是“异相”的，我的意思是一个从奇数（1-2 秒）开始在 2 秒内分箱，而一个从偶数（4-5秒）。所以你提到的date_range 不会创建这些垃圾箱，它会从 0-1s、2-3s、4-5s ......或 1-2s、3-4s、5-6s 创建 dfs。 .. 取决于它开始的时间戳。

对于后一点，您可以使用.resample() 的base 参数来设置重采样的“相位”。所以在上面的例子中，base=0 会在偶数上开始分类，base=1 会在赔率上开始分类。

这是假设您对这种类型的分箱没问题 - 如果您真的希望 1-2 秒和 4-5 秒在不同的分箱中，我相信您将不得不做一些更复杂的事情。

【讨论】：

df = pd.read_csv('dates.csv', index_col='timestamp', parse_dates=True) 似乎删除了时间戳列。如何预防？
您可以删除 index_col='timestamp' 参数 - 这会将该列保留在 df 中。但是数据并没有消失；使用index_col，timestamp 成为DataFrame 的索引（行标签），您仍然可以使用df.index 访问这些值。索引中有时间序列数据是很常见的，resample() 会自动作用于index（但您可以使用on 参数来更改）