带有 matplotlib 和分组 Pandas 数据框的 Stackplot答案

【问题标题】：Stackplot with matplotlib and a grouped Pandas dataframe带有 matplotlib 和分组 Pandas 数据框的 Stackplot
【发布时间】：2020-04-14 09:21:15
【问题描述】：

我使用的数据是对话消息日志。我有一个以日期戳为索引的 Pandas 数据框和两列；一个用于“发送者”，一个用于“消息”。

我只是想随着时间的推移绘制消息的堆栈图。我实际上并不需要message的内容，所以我已经清理了数据如下：

虚拟数据：

df = pd.Dataframe({'date': [Timestamp('2019-07-29 19:58:00'), Timestamp('2019-07-29 20:03:00'), Timestamp('2019-08-01 19:22:00'), Timestamp('2019-08-01 19:23:00'), Timestamp('2019-08-01 19:25:00'), Timestamp('2019-08-04 11:28:00'), Timestamp('2019-08-04 11:29:00'), Timestamp('2019-08-04 11:29:00'), Timestamp('2019-08-04 12:43:00'), Timestamp('2019-08-04 12:49:00'), Timestamp('2019-08-04 12:51:00'), Timestamp('2019-08-04 12:51:00'), Timestamp('2019-08-25 22:33:00'), Timestamp('2019-08-27 11:55:00'), Timestamp('2019-08-27 18:35:00'), Timestamp('2019-11-06 18:53:00'), Timestamp('2019-11-06 18:54:00'), Timestamp('2019-11-06 20:42:00'), Timestamp('2019-11-07 00:16:00'), Timestamp('2019-11-07 15:24:00'), Timestamp('2019-11-07 16:06:00'), Timestamp('2019-11-08 11:48:00'), Timestamp('2019-11-08 11:53:00'), Timestamp('2019-11-08 11:55:00'), Timestamp('2019-11-08 11:55:00'), Timestamp('2019-11-08 11:59:00'), Timestamp('2019-11-08 12:03:00'), Timestamp('2019-12-24 13:40:00'), Timestamp('2019-12-24 13:42:00'), Timestamp('2019-12-24 13:43:00'), Timestamp('2019-12-24 13:44:00'), Timestamp('2019-12-24 13:44:00')], 'sender': ['Person 2', 'Person 1', 'Person 2', 'Person 1', 'Person 2', 'Person 1', 'Person 2', 'Person 1', 'Person 1', 'Person 2', 'Person 1', 'Person 2', 'Person 1', 'Person 2', 'Person 2', 'Person 2', 'Person 2', 'Person 1', 'Person 2', 'Person 1', 'Person 2', 'Person 2', 'Person 1', 'Person 2', 'Person 2', 'Person 1', 'Person 2', 'Person 2', 'Person 1', 'Person 2', 'Person 1', 'Person 2'], 'message': ['Hello', 'Hi there', "How's things", 'good', 'I am glad', 'Me too.', 'Then we are both glad', 'Indeed we are.', 'I sure hope this is enough fake conversation for stackoverflow.', 'Better write a few more messages just in case', "But the message content isn't relevant", 'Oh yeah.', "I'm going to stop now.", 'redacted', 'redacted', 'redacted', 'redacted', 'redacted', 'redacted', 'redacted', 'redacted', 'redacted', 'redacted', 'redacted', 'redacted', 'redacted', 'redacted', 'redacted', 'redacted', 'redacted', 'redacted', 'redacted']})

dfgrouped = df.groupby(["sender"])
dfgrouped[["sender"]].resample("D").count()

这给出了一个按对话中的每个发件人分组的数据框，其中 DateTime 作为索引以及在给定日期发送的消息数。

dfgrouped[["sender"]].get_group("Joe Bloggs").resample("D").count()

... 将提供一个只有一个用户及其每天的消息数的数据框。

我想知道如何使用 matplotlib 绘制堆栈图，其中每个“发件人”是不同的行。我无法通过这两种方式实现这一目标

ax.stackplot(dfgrouped[["sender"]].resample("D").count())

或通过循环：

for sender in df["sender"].unique():
     axs[i].stackplot(dfgrouped[["sender"]].get_group(sender).resample("D").count()

【问题讨论】：

如果您提供模型数据会有所帮助，特别是查看How to make good reproducible pandas examples
谢谢，我添加了一些虚拟数据。

标签： python pandas matplotlib

【解决方案1】：

你可以使用 pandas 自己的 stackplot 函数，df.plot.area()。这是 Matplotlib 函数的包装器，用作 DataFrames 上的方法。您只需要以正确的形状获取数据。使用 groupby 和 count 操作，您就快到了：

import pandas as pd

df = pd.DataFrame({'sender': [
    'Person 2', 'Person 1', 'Person 2', 'Person 1', 'Person 2', 'Person 1', 'Person 2', 
    'Person 1', 'Person 1', 'Person 2', 'Person 1', 'Person 2', 'Person 1', 'Person 2', 
    'Person 2', 'Person 2', 'Person 2', 'Person 1', 'Person 2', 'Person 1', 'Person 2', 
    'Person 2', 'Person 1', 'Person 2', 'Person 2', 'Person 1', 'Person 2', 'Person 2', 
    'Person 1', 'Person 2', 'Person 1', 'Person 2'], 
    'message': [
    'Hello', 'Hi there', "How's things", 'good', 'I am glad', 'Me too.', 
    'Then we are both glad', 'Indeed we are.', 
    'I sure hope this is enough fake conversation for stackoverflow.', 
    'Better write a few more messages just in case', 
    "But the message content isn't relevant", 'Oh yeah.', "I'm going to stop now.", 
    'redacted', 'redacted', 'redacted', 'redacted', 'redacted', 'redacted', 'redacted', 
    'redacted', 'redacted', 'redacted', 'redacted', 'redacted', 'redacted', 'redacted', 
    'redacted', 'redacted', 'redacted', 'redacted', 'redacted']}, 
    index = pd.DatetimeIndex([
    pd.Timestamp('2019-07-29 19:58:00'), pd.Timestamp('2019-07-29 20:03:00'), 
    pd.Timestamp('2019-08-01 19:22:00'), pd.Timestamp('2019-08-01 19:23:00'),
    pd.Timestamp('2019-08-01 19:25:00'), pd.Timestamp('2019-08-04 11:28:00'), 
    pd.Timestamp('2019-08-04 11:29:00'), pd.Timestamp('2019-08-04 11:29:00'), 
    pd.Timestamp('2019-08-04 12:43:00'), pd.Timestamp('2019-08-04 12:49:00'), 
    pd.Timestamp('2019-08-04 12:51:00'), pd.Timestamp('2019-08-04 12:51:00'), 
    pd.Timestamp('2019-08-25 22:33:00'), pd.Timestamp('2019-08-27 11:55:00'), 
    pd.Timestamp('2019-08-27 18:35:00'), pd.Timestamp('2019-11-06 18:53:00'), 
    pd.Timestamp('2019-11-06 18:54:00'), pd.Timestamp('2019-11-06 20:42:00'), 
    pd.Timestamp('2019-11-07 00:16:00'), pd.Timestamp('2019-11-07 15:24:00'), 
    pd.Timestamp('2019-11-07 16:06:00'), pd.Timestamp('2019-11-08 11:48:00'), 
    pd.Timestamp('2019-11-08 11:53:00'), pd.Timestamp('2019-11-08 11:55:00'), 
    pd.Timestamp('2019-11-08 11:55:00'), pd.Timestamp('2019-11-08 11:59:00'), 
    pd.Timestamp('2019-11-08 12:03:00'), pd.Timestamp('2019-12-24 13:40:00'), 
    pd.Timestamp('2019-12-24 13:42:00'), pd.Timestamp('2019-12-24 13:43:00'), 
    pd.Timestamp('2019-12-24 13:44:00'), pd.Timestamp('2019-12-24 13:44:00')]))

df_group = df.groupby(["sender"])
df_count = df_group[["sender"]].resample("D").count()

df_plot = pd.concat([df_count.loc['Person 1', :], 
                     df_count.loc['Person 2', :]], 
                    axis=1)
df_plot.columns = ['Sender 1', 'Sender 2']

df_plot.plot.area()

【讨论】：