让我们先生成一些随机数据(生成取自here的随机日期):
import pandas as pd
import numpy as np
import datetime
def randomtimes(start, end, n):
frmt = '%d-%m-%Y %H:%M:%S'
stime = datetime.datetime.strptime(start, frmt)
etime = datetime.datetime.strptime(end, frmt)
td = etime - stime
dtimes = [np.random.random() * td + stime for _ in range(n)]
return [d.strftime(frmt) for d in dtimes]
# Recreat some fake data
timestamp = randomtimes("01-01-2021 00:00:00", "01-01-2023 00:00:00", 10000)
signal_value = np.random.random(len(timestamp)) * 10
df = pd.DataFrame({"timestamp": timestamp, "signal_value": signal_value})
现在我们可以将时间戳列转换为 pandas 时间戳,以提取每个时间戳的月份和年份:
df.timestamp = pd.to_datetime(df.timestamp)
df["month"] = df.timestamp.dt.month
df["year"] = df.timestamp.dt.year
我们生成一个布尔列是否signal_value 大于某个阈值(此处为 5):
df["is_larger5"] = df.signal_value > 5
最后,我们可以使用 pandas.groupby 得到每个月的平均值:
>>> df.groupby(["year", "month"])['is_larger5'].mean()
year month
2021 1 0.509615
2 0.488189
3 0.506024
4 0.519362
5 0.498778
6 0.483709
7 0.498824
8 0.460396
9 0.542918
10 0.463043
11 0.492500
12 0.519789
2022 1 0.481663
2 0.527778
3 0.501139
4 0.527322
5 0.486936
6 0.510638
7 0.483370
8 0.521253
9 0.493639
10 0.495349
11 0.474886
12 0.488372
Name: is_larger5, dtype: float64