将数据集从“宽到长”格式转换/重塑，并将时间列转换为时间格式以进行时间序列分析答案

【问题标题】：Convert /reshape a dataset from 'wide to long' format and convert the time column into time format for time-series analysis将数据集从“宽到长”格式转换/重塑，并将时间列转换为时间格式以进行时间序列分析
【发布时间】：2021-05-17 23:04:55
【问题描述】：

我有一个包含 7 列的数据集 - level,Time_30,Time_60,Time_90,Time_120,Time_150 和 Time_180

我的主要目标是每隔 30 分钟使用细胞计数进行时间序列异常检测。

我想做以下数据准备步骤：

(I) 将df 融化/重塑为适当的时间序列格式（从宽到长）- 将列time_30、time_60、.....、time_180 合并为一列time 6 级 (30,60,.....,180)

(II) 因为 (I) 的结果是 30,60,.....180，我想将 time 列设置为时间序列的适当时间或日期格式（类似于 '%H:%M :%S')

(III) 使用 for 循环绘制每个级别的时间序列图 - A、B、....、F) 以进行比较。

（四）异常检测

# generate/import dataset
import pandas as pd 

df = pd.DataFrame({'level':[A,B,C,D,E,F], 
       'Time_30':[1993.05,1999.45, 2001.11, 2007.39, 2219.77],
       'Time_60':[2123.15,2299.59, 2339.19, 2443.37, 2553.15],
       'Time_90':[2323.56,2495.99,2499.13, 2548.71, 2656.0],
       'Time_120':[2355.52,2491.19,2519.92,2611.81, 2753.11],
       'Time_150':[2425.31,2599.51, 2539.9, 2713.77, 2893.58],
       'Time_180':[2443.35,2609.92, 2632.49, 2774.03, 2901.25]} )

期望的结果

# first series
level, time, count
A, 30, 1993.05
B, 60, 2123.15
C, 90, 2323.56
D, 120, 2355.52
E, 150, 2425.31
F, 180, 2443.35 

# 2nd series 
level,time,count 
A,30,1999.45
B,60,2299.59
C,90,2495.99
D,120,2491.19
E,150,2599.51
F,180,2609.92

.
.
.
.
# up until the last series

下面是我的尝试


# (I)
df1 = pd.melt(df,id_vars = ['level'],var_name = 'time',value_name = 'count') #

# (II)

df1['time'] = pd.to_datetime(df1['time'],format= '%H:%M:%S' ).dt.time

OR

df1['time'] = pd.to_timedelta(df1['time'], unit='m')


# (III)

plt.figure(figsize=(10,5))
plt.plot(df1)
for timex in range(30,180):
    plt.axvline(datetime(timex,1,1), color='k', linestyle='--', alpha=0.3)

# Perform STL Decomp
stl = STL(df1)
result = stl.fit()

seasonal, trend, resid = result.seasonal, result.trend, result.resid

plt.figure(figsize=(8,6))

plt.subplot(4,1,1)
plt.plot(df1)
plt.title('Original Series', fontsize=16)

plt.subplot(4,1,2)
plt.plot(trend)
plt.title('Trend', fontsize=16)

plt.subplot(4,1,3)
plt.plot(seasonal)
plt.title('Seasonal', fontsize=16)

plt.subplot(4,1,4)
plt.plot(resid)
plt.title('Residual', fontsize=16)

plt.tight_layout()

estimated = trend + seasonal
plt.figure(figsize=(12,4))
plt.plot(df1)
plt.plot(estimated)

plt.figure(figsize=(10,4))
plt.plot(resid)

# Anomaly detection 

resid_mu = resid.mean()
resid_dev = resid.std()

lower = resid_mu - 3*resid_dev
upper = resid_mu + 3*resid_dev

anomalies = df1[(resid < lower) | (resid > upper)] # returns the datapoints with the anomalies
anomalies


plt.plot(df1)
for timex in range(30,180):
    plt.axvline(datetime(timex,1,1), color='k', linestyle='--', alpha=0.6)
    
plt.scatter(anomalies.index, anomalies.count, color='r', marker='D')

请注意：如果您只能尝试 I 和/或 II，将不胜感激。

【问题讨论】：

上面好像有一些错误。在您的示例数据框中，A, B, C... 必须是 'A','B','C',...，并且您在 level 列中有六个项目，但在其他列中只有 5 个。其次，您对第一个系列的期望结果应该是 A-F，还是应该都是 A？

标签： python pandas plot time-series

【解决方案1】：

根据我上面的评论，我对您的示例数据框进行了一些小修改：

import pandas as pd 

df = pd.DataFrame({'level':['A','B','C','D','E'], 
       'Time_30':[1993.05,1999.45, 2001.11, 2007.39, 2219.77],
       'Time_60':[2123.15,2299.59, 2339.19, 2443.37, 2553.15],
       'Time_90':[2323.56,2495.99,2499.13, 2548.71, 2656.0],
       'Time_120':[2355.52,2491.19,2519.92,2611.81, 2753.11],
       'Time_150':[2425.31,2599.51, 2539.9, 2713.77, 2893.58],
       'Time_180':[2443.35,2609.92, 2632.49, 2774.03, 2901.25]} )

首先，将Time_* 列名操作为整数值：

timecols = [int(c.replace("Time_","")) for c in df.columns if c != 'level']
df.columns = ['level'] + timecols

之后，您可以pd.melt() 就像您想的那样，生成一个数据帧，其中包含您上面提到的所有“系列”连接在一起：

df1 = df.melt(id_vars=['level'], value_vars=timecols, var_name='time', value_name='count').sort_values(['level','time']).reset_index(drop=True)

print(df1.head(10))
  level time    count
0     A   30  1993.05
1     A   60  2123.15
2     A   90  2323.56
3     A  120  2355.52
4     A  150  2425.31
5     A  180  2443.35
6     B   30  1999.45
7     B   60  2299.59
8     B   90  2495.99
9     B  120  2491.19

如果你想遍历levels，选择它们：

for level in df1['level'].unique():
    tmp = df1[df1['level']==level]

或

for level in df1['level'].unique():
    tmp = df1[df1['level']==level].copy()

...如果您打算修改/添加数据到 tmp 数据帧。

至于制作时间戳，您可以这样做：

df1['time'] = pd.to_timedelta(df1['time'], unit='min')

...就像您尝试的那样，但这取决于您如何使用它。如果您只想要看起来像“00:30:00”等的字符串，您可以尝试以下操作：

df1['time'] = pd.to_timedelta(df1['time'], unit='min').apply(lambda x:str(x)[-8:])

无论如何，希望这能让您走上所需的轨道。

【讨论】：

@Rick M 非常感谢。