【发布时间】:2021-12-29 15:39:06
【问题描述】:
我有一个包含多个 ID 和多个变量的时间序列数据集,每个变量都有 3 个时间序列条目 - “基线”、“3 个月”、“6 个月”。数据框的结构是这样的,df =
import pandas as pd
data = {'Patient ID': [11111, 11111, 11111, 11111, 22222, 22222, 22222, 22222, 33333, 33333, 33333, 33333, 44444, 44444, 44444, 44444, 55555, 55555, 55555, 55555],
'Lab Attribute': ['% Saturation- Iron', 'ALK PHOS', 'ALT(SGPT)', 'AST (SGOT)', '% Saturation- Iron', 'ALK PHOS', 'ALT(SGPT)', 'AST (SGOT)', '% Saturation- Iron', 'ALK PHOS', 'ALT(SGPT)', 'AST (SGOT)', '% Saturation- Iron', 'ALK PHOS', 'ALT(SGPT)', 'AST (SGOT)', '% Saturation- Iron', 'ALK PHOS', 'ALT(SGPT)', 'AST (SGOT)'],
'Baseline': [46.0, 94.0, 21.0, 18.0, 46.0, 94.0, 21.0, 18.0, 46.0, 94.0, 21.0, 18.0, 46.0, 94.0, 21.0, 18.0, 46.0, 94.0, 21.0, 18.0],
'3 Month': [23.0, 82.0, 13.0, 17.0, 23.0, 82.0, 13.0, 17.0, 23.0, 82.0, 13.0, 17.0, 23.0, 82.0, 13.0, 17.0, 23.0, 82.0, 13.0, 17.0],
'6 Month': [34.0, 65.0, 10.0, 14.0, 34.0, 65.0, 10.0, 14.0, 34.0, 65.0, 10.0, 14.0, 34.0, 65.0, 10.0, 14.0, 34.0, 65.0, 10.0, 14.0]}
df = pd.DataFrame(data)
Patient ID Lab Attribute Baseline 3 Month 6 Month
0 11111 % Saturation- Iron 46.0 23.0 34.0
1 11111 ALK PHOS 94.0 82.0 65.0
2 11111 ALT(SGPT) 21.0 13.0 10.0
3 11111 AST (SGOT) 18.0 17.0 14.0
4 22222 % Saturation- Iron 46.0 23.0 34.0
5 22222 ALK PHOS 94.0 82.0 65.0
6 22222 ALT(SGPT) 21.0 13.0 10.0
7 22222 AST (SGOT) 18.0 17.0 14.0
8 33333 % Saturation- Iron 46.0 23.0 34.0
9 33333 ALK PHOS 94.0 82.0 65.0
10 33333 ALT(SGPT) 21.0 13.0 10.0
11 33333 AST (SGOT) 18.0 17.0 14.0
12 44444 % Saturation- Iron 46.0 23.0 34.0
13 44444 ALK PHOS 94.0 82.0 65.0
14 44444 ALT(SGPT) 21.0 13.0 10.0
15 44444 AST (SGOT) 18.0 17.0 14.0
16 55555 % Saturation- Iron 46.0 23.0 34.0
17 55555 ALK PHOS 94.0 82.0 65.0
18 55555 ALT(SGPT) 21.0 13.0 10.0
19 55555 AST (SGOT) 18.0 17.0 14.0
我要做的是按 ID 和实验室属性分组,并创建每个“实验室属性”的图 - 饱和度百分比 - 铁、ALK PHOS 等,其中将包括所有患者 ID。
因此,根据示例数据,将有 4 个图 - 饱和度百分比 - 铁、ALK PHOS 等,每个图将包含 5 条迹线(每个 ID 1 条)。
我尝试在本文中使用 groupby - Creating a time-series plot with data in long format in python?
虽然它只是将所有内容都绘制在一张图表上。
这是我目前的代码:
df_labs = pd.read_csv("/Users/johnconor/Documents/Python/gut_microbiome/out/nw_labs_up_to_6mon.csv")
df_labs = df_labs.fillna(method='ffill')
dfl = df_labs.groupby(['Patient_ID', 'Lab_Attribute'])['Baseline','3 Month','6 Month'].sum().plot()
结果如下:
[![在此处输入图片描述][1]][1]
我遇到的部分问题是,我能找到的所有示例都有长格式数据,只有 1 个值列。不是一段时间内的值。
我还尝试将这种方法用于本文中的多个绘图 - Creating a time-series plot with data in long format in python?
n_ids = df_labs.Patient_ID.unique().size
n_cols = int(n_ids ** 0.5)
n_rows = n_cols + (1 if n_ids % n_cols else 0)
fig, axes = plt.subplots(n_rows, n_cols)
axes = axes.ravel()
for i, (id, att, base,three,six) in enumerate(df_labs.groupby(['Patient_ID', 'Lab_Attribute'])['Baseline','3 Month','6 Month'].sum().reset_index()):
print(idx)
series.plot(ax=axes[i], title=f"ID:{idx}")
fig.tight_layout()
虽然我遇到了问题,因为它再次设计为仅用于一组值。产生错误:
ValueError: too many values to unpack (expected 5)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-40-42cf5bc14bdb> in <module>
4 fig, axes = plt.subplots(n_rows, n_cols)
5 axes = axes.ravel()
----> 6 for i, (id, att, base,three,six) in enumerate(df_labs.groupby(['Patient_ID', 'Lab_Attribute'])['Baseline','3 Month','6 Month'].sum().reset_index()):
7 print(idx)
8 series.plot(ax=axes[i], title=f"ID:{idx}")
ValueError: too many values to unpack (expected 5)
【问题讨论】:
标签: python pandas matplotlib time-series seaborn