Pandas 对条形图进行分组和重采样：答案

【问题标题】：Pandas grouping and resampling for a bar plot:Pandas 对条形图进行分组和重采样：
【发布时间】：2020-02-03 21:15:19
【问题描述】：

我有一个数据框，它记录了不同年份几个不同地点的浓度，时间频率很高（

要计算平均浓度，我必须对每日和每月数据应用质量控制过滤器。

我的方法是首先应用过滤器并每年重新采样，然后按位置和年份进行分组。

此外，在所有位置（在标题为位置的列中）中，我只需要选择几行。因此，我正在对原始数据框进行切片并创建一个包含选定行的新数据框。

我无法使用以下代码实现此目的：

date=df['date']
location = df['location']
df.date = pd.to_datetime(df.date)
year=df.date.dt.year
df=df.set_index(date)


df['Year'] = df['date'].map(lambda x: x.year )

#Location name selection/correction in each city:
#Changing all stations:
df['location'] = df['location'].map(lambda x: "M" if x == "mm" else x)

#New dataframe:
df_new = df[(df['location'].isin(['K', 'L', 'M']))]


#Data filtering:
df_new = df_new[df_new['value'] >= 0]

df_new.drop(df_new[df_new['value'] > 400].index, inplace = True)

df_new.drop(df_new[df_new['value'] <2].index, inplace = True)

diurnal = df_new[df_new['value']].resample('12h')

diurnal_mean = diurnal.mean()[diurnal.count() >= 9]

daily_mean=diurnal_mean.resample('d').mean()

df_month=daily_mean.resample('m').mean()

df_yearly=df_month[df_month['value']].resample('y')

#For plotting:

df_grouped = df_new.groupby(['location', 'Year']).agg({'value':'sum'}).reset_index()

sns.barplot(x='location',y='value',hue='Year',data= df_grouped)

这是出现的众多错误之一：

"None of [Float64Index([22.73, 64.81,  8.67, 19.98, 33.12, 37.81, 39.87, 42.29, 37.81,\n              36.51,\n              ...\n               11.0,  40.0,  23.0,  80.0,  50.0,  60.0,  40.0,  80.0,  80.0,\n               17.0],\n             dtype='float64', length=63846)] are in the [columns]"
ERROR:root:Invalid alias: The name clear can't be aliased because it is another magic command.

这是一个示例数据框，显示了我需要绘制的内容；在执行质量控制操作和重新采样之后，值列应该理想地代表重新采样的值。

Unnamed: 0 location  value  \
date                                    location          value                                                                         
2017-10-21 08:45:00+05:30        8335    M                339.3   
2017-08-18 17:45:00+05:30        8344    M                 45.1   
2017-11-08 13:15:00+05:30        8347    L                594.4   
2017-10-21 13:15:00+05:30        8659    N                189.9   
2017-08-18 15:45:00+05:30        8662    N                 46.5

这是实际数据的一部分在选择所选位置后的样子。我是新用户，所以无法附上我需要的图表的屏幕截图。这个查询是我之前发布的查询的扩展，另外还需要绘制重采样数据而不是简单的值计数。 Iteration over years to plot different group values as bar plot in pandas

任何帮助将不胜感激。

【问题讨论】：

请为reproducible example 发布df 的数据，然后理想地显示所需的结果。
df 之前还是之后？作为minimal reproducible example 的测试，尝试在空的 Python 环境中准确运行您发布的内容（数据 + 代码），确保它重现错误或不希望的结果。
这是一个虚拟数据框，它显示了我想要绘制的内容；理想情况下，值列应包括最终重新取样的质量控制浓度。

标签： python-3.x pandas dataframe pandas-groupby timeserieschart

【解决方案1】：

从根本上说，您的错误来自这种不清楚的索引，您在其中传递一列的连续浮点值，以按行选择当前为日期时间类型的索引。

df_new[df_new['value']]           # INDEXING DATETIME USING FLOAT VALUES
...
df_month[df_month['value']]       # COLUMN value DOES NOT EXIST

您可能打算在重采样期间选择列 value（从其他列中）。

diurnal = df_new['value'].resample('12h')

diurnal.mean()[diurnal.count() >= 9]

daily_mean = diurnal_mean.resample('d').mean()    
df_month = daily_mean.resample('m').mean()       # REMOVE value BEING UNDERLYING SERIES
df_yearly = df_month.resample('y')

但是，您没有在上面保留位置用于绘图。因此，不要使用resample，而是使用groupby(pd.Grouper(...))

# AGGREGATE TO KEEP LOCATION AND 12h
diurnal = (df_new.groupby(["location", pd.Grouper(freq='12h')])["value"]
                 .agg(["count", "mean"])
                 .reset_index().set_index(['date'])
           )
# FILTER
diurnal_sub = diurnal[diurnal["count"] >= 9]

# MULTIPLE DATE TIME LEVEL MEANS
daily_mean = diurnal_sub.groupby(["location", pd.Grouper(freq='d')])["mean"].mean()
df_month = diurnal_sub.groupby(["location", pd.Grouper(freq='m')])["mean"].mean()
df_yearly = diurnal_sub.groupby(["location", pd.Grouper(freq='y')])["mean"].mean()

print(df_yearly)

使用随机、可重复的数据进行演示：

数据

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

np.random.seed(242020)
random_df = pd.DataFrame({'date': (np.random.choice(pd.date_range('2017-01-01', '2019-12-31'), 5000) + 
                                   pd.to_timedelta(np.random.randint(60*60, 60*60*24, 5000), unit='s')),
                          'location': np.random.choice(list("KLM"), 5000),
                          'value': np.random.uniform(10, 1000, 5000)                          
                         })

聚合

loc_list = list("KLM")

# NEW DATA FRAME WITH DATA FILTERING
df = (random_df.set_index(random_df['date'])
               .assign(Year = lambda x: x['date'].dt.year,
                       location = lambda x: x['location'].where(x["location"] != "mm", "M"))
               .query('(location == @loc_list) and (value >= 2 and value <= 400)')
      )

# 12h AGGREGATION
diurnal = (df_new.groupby(["location", pd.Grouper(freq='12h')])["value"]
                 .agg(["count", "mean"])
                 .reset_index().set_index(['date'])
                 .query("count >= 2")
          )


# d, m, y AGGREGATION
daily_mean = diurnal.groupby(["location", pd.Grouper(freq='d')])["mean"].mean()
df_month = diurnal.groupby(["location", pd.Grouper(freq='m')])["mean"].mean()
df_yearly = (diurnal.groupby(["location", pd.Grouper(freq='y')])["mean"].mean()
                    .reset_index()
                    .assign(Year = lambda x: x["date"].dt.year)
            )

print(df_yearly)
#   location       date        mean  Year
# 0        K 2017-12-31  188.984592  2017
# 1        K 2018-12-31  199.521702  2018
# 2        K 2019-12-31  216.497268  2019
# 3        L 2017-12-31  214.347873  2017
# 4        L 2018-12-31  199.232711  2018
# 5        L 2019-12-31  177.689221  2019
# 6        M 2017-12-31  222.412711  2017
# 7        M 2018-12-31  241.597977  2018
# 8        M 2019-12-31  215.554228  2019

绘图

sns.set()
fig, axs = plt.subplots(figsize=(12,5))
sns.barplot(x='location', y='mean', hue='Year', data= df_yearly, ax=axs)

plt.title("Location Value Yearly Aggregation", weight="bold", size=16)
plt.show()
plt.clf()
plt.close()

【讨论】：