如何将数据从宽转换为长，以便根据时间绘制值答案

【问题标题】：How to convert data from wide to long so values are plotted against time如何将数据从宽转换为长，以便根据时间绘制值
【发布时间】：2021-12-29 15:39:06
【问题描述】：

我有一个包含多个 ID 和多个变量的时间序列数据集，每个变量都有 3 个时间序列条目 - “基线”、“3 个月”、“6 个月”。数据框的结构是这样的，df =

import pandas as pd

data = {'Patient ID': [11111, 11111, 11111, 11111, 22222, 22222, 22222, 22222, 33333, 33333, 33333, 33333, 44444, 44444, 44444, 44444, 55555, 55555, 55555, 55555],
        'Lab Attribute': ['% Saturation- Iron', 'ALK PHOS', 'ALT(SGPT)', 'AST (SGOT)', '% Saturation- Iron', 'ALK PHOS', 'ALT(SGPT)', 'AST (SGOT)', '% Saturation- Iron', 'ALK PHOS', 'ALT(SGPT)', 'AST (SGOT)', '% Saturation- Iron', 'ALK PHOS', 'ALT(SGPT)', 'AST (SGOT)', '% Saturation- Iron', 'ALK PHOS', 'ALT(SGPT)', 'AST (SGOT)'],
        'Baseline': [46.0, 94.0, 21.0, 18.0, 46.0, 94.0, 21.0, 18.0, 46.0, 94.0, 21.0, 18.0, 46.0, 94.0, 21.0, 18.0, 46.0, 94.0, 21.0, 18.0],
        '3 Month': [23.0, 82.0, 13.0, 17.0, 23.0, 82.0, 13.0, 17.0, 23.0, 82.0, 13.0, 17.0, 23.0, 82.0, 13.0, 17.0, 23.0, 82.0, 13.0, 17.0],
        '6 Month': [34.0, 65.0, 10.0, 14.0, 34.0, 65.0, 10.0, 14.0, 34.0, 65.0, 10.0, 14.0, 34.0, 65.0, 10.0, 14.0, 34.0, 65.0, 10.0, 14.0]}
df = pd.DataFrame(data)

    Patient ID       Lab Attribute  Baseline  3 Month  6 Month
0        11111  % Saturation- Iron      46.0     23.0     34.0
1        11111            ALK PHOS      94.0     82.0     65.0
2        11111           ALT(SGPT)      21.0     13.0     10.0
3        11111          AST (SGOT)      18.0     17.0     14.0
4        22222  % Saturation- Iron      46.0     23.0     34.0
5        22222            ALK PHOS      94.0     82.0     65.0
6        22222           ALT(SGPT)      21.0     13.0     10.0
7        22222          AST (SGOT)      18.0     17.0     14.0
8        33333  % Saturation- Iron      46.0     23.0     34.0
9        33333            ALK PHOS      94.0     82.0     65.0
10       33333           ALT(SGPT)      21.0     13.0     10.0
11       33333          AST (SGOT)      18.0     17.0     14.0
12       44444  % Saturation- Iron      46.0     23.0     34.0
13       44444            ALK PHOS      94.0     82.0     65.0
14       44444           ALT(SGPT)      21.0     13.0     10.0
15       44444          AST (SGOT)      18.0     17.0     14.0
16       55555  % Saturation- Iron      46.0     23.0     34.0
17       55555            ALK PHOS      94.0     82.0     65.0
18       55555           ALT(SGPT)      21.0     13.0     10.0
19       55555          AST (SGOT)      18.0     17.0     14.0

我要做的是按 ID 和实验室属性分组，并创建每个“实验室属性”的图 - 饱和度百分比 - 铁、ALK PHOS 等，其中将包括所有患者 ID。

因此，根据示例数据，将有 4 个图 - 饱和度百分比 - 铁、ALK PHOS 等，每个图将包含 5 条迹线（每个 ID 1 条）。

我尝试在本文中使用 groupby - Creating a time-series plot with data in long format in python?

虽然它只是将所有内容都绘制在一张图表上。

这是我目前的代码：

df_labs = pd.read_csv("/Users/johnconor/Documents/Python/gut_microbiome/out/nw_labs_up_to_6mon.csv")
df_labs = df_labs.fillna(method='ffill')

dfl = df_labs.groupby(['Patient_ID', 'Lab_Attribute'])['Baseline','3 Month','6 Month'].sum().plot()

结果如下：

[![在此处输入图片描述][1]][1]

我遇到的部分问题是，我能找到的所有示例都有长格式数据，只有 1 个值列。不是一段时间内的值。

我还尝试将这种方法用于本文中的多个绘图 - Creating a time-series plot with data in long format in python?

n_ids = df_labs.Patient_ID.unique().size
n_cols = int(n_ids ** 0.5)
n_rows = n_cols + (1 if n_ids % n_cols else 0)                   
fig, axes = plt.subplots(n_rows, n_cols)
axes = axes.ravel()
for i, (id, att, base,three,six) in enumerate(df_labs.groupby(['Patient_ID', 'Lab_Attribute'])['Baseline','3 Month','6 Month'].sum().reset_index()):
    print(idx)
    series.plot(ax=axes[i], title=f"ID:{idx}")
fig.tight_layout()

虽然我遇到了问题，因为它再次设计为仅用于一组值。产生错误：

ValueError: too many values to unpack (expected 5)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-40-42cf5bc14bdb> in <module>
      4 fig, axes = plt.subplots(n_rows, n_cols)
      5 axes = axes.ravel()
----> 6 for i, (id, att, base,three,six) in enumerate(df_labs.groupby(['Patient_ID', 'Lab_Attribute'])['Baseline','3 Month','6 Month'].sum().reset_index()):
      7     print(idx)
      8     series.plot(ax=axes[i], title=f"ID:{idx}")

ValueError: too many values to unpack (expected 5)

【问题讨论】：

标签： python pandas matplotlib time-series seaborn

【解决方案1】：

应使用.melt 将数据框的形状更改为长格式，这样可以将月份用作时间轴。
最容易使用seaborn.relplot 和kind='line' 来创建可视化。
- 更改col、row 和/或hue 以调整数据的分组方式。不要更改x 和y。
要防止共享y，请参阅Prevent Sharing of Y Axes in Seaborn Relplot

import pandas as pd
import seaborn as sns

# reshape the dataframe
dfm = df.melt(id_vars=['Patient ID', 'Lab Attribute'], var_name='Months')

# change the Months values to numeric
dfm.Months = dfm.Months.map({'Baseline': 0, '3 Month': 3, '6 Month': 6})

# display(dfm.head())
   Patient ID       Lab Attribute  Months  value
0       11111  % Saturation- Iron       0   46.0
1       11111            ALK PHOS       0   94.0
2       11111           ALT(SGPT)       0   21.0
3       11111          AST (SGOT)       0   18.0
4       22222  % Saturation- Iron       0   46.0

# plot a figure level line plot with seaborn
p = sns.relplot(data=dfm, col='Lab Attribute', x='Months', y='value', hue='Patient ID', kind='line', col_wrap=4, marker='o', palette='husl')

因为数据值都相同，所以这些行是堆叠的

使用seaborn.catplot 和kind='bar' 进行条形图可视化

p = sns.catplot(data=dfm, col='Lab Attribute', x='Months', y='value', hue='Patient ID', kind='bar', col_wrap=4)

【讨论】：