将年度数据插入到每小时作为 Python 中的函数答案

【问题标题】：Interpolate annual data to hourly as a function in Python将年度数据插入到每小时作为 Python 中的函数
【发布时间】：2021-11-22 18:05:57
【问题描述】：

我希望获取年度人口数据并将其插入到每小时的时间序列中。我正在尝试创建一个函数，该函数为给定样本年份的每小时人口的每个唯一名称生成一个时间序列。我已经包含了下面的代码以及示例数据：

import pandas as pd
import random
from scipy.interpolate import interp1d

name = ['RI', 'NH', 'MA', 'RI', 'NH', 'MA','RI', 'NH', 'MA','RI', 'NH', 'MA']
year = [2015, 2015, 2015, 2016, 2016, 2016, 2017, 2017, 2017, 2018, 2018, 2018]
population = random.sample(range(10000, 300000), 12)

df_pop = pd.DataFrame(list(zip(name, year, population)))

start_year = 2015 
end_year = 2018 

def pop_sum(df_pop, start_year, end_year):

    names = df_pop['name'].unique()

    df = pd.DataFrame([])
    for i in names):

        t = df_pop['year']
        y1 = df_pop['population']
        x = pd.DataFrame({'Hours': pd.date_range(f'{start_year}-01-01', f'{end_year}-12-31',
                                                 freq='1H', closed='left')})

        pop_interp = interp1d(t, y1, x, 'linear')
    
        df = df.append(pop_interp)

    return df

但是，此脚本不起作用，并且不能循环名称。我尝试在网上寻找资源，但从每年到每小时的时间序列转换远没有每小时到每年那么常见。我已经尝试过 scipy 的 interp1d，但我愿意接受其他可能也可以完成相同工作的软件包的建议。提前感谢您的建议。

【问题讨论】：

请创建一个具有预期输出的小型可重复数据框
一年有8760小时。您确定要获得这样的粒度吗？
@ddejohn，是的，它稍后需要与将在 ML 模型中使用的其他小时数据集结合起来

标签： python pandas scipy time-series

【解决方案1】：

您可以将年份转换为日期时间，将其设置为索引，重新索引为每小时频率，并使用对您的目的有意义的方法使用 df.interpolate（包装 SciPy）进行插值：

# Ensure reproducibility
random.seed(123)

# Your example data
name = ['RI', 'NH', 'MA', 'RI', 'NH', 'MA','RI', 'NH', 'MA','RI', 'NH', 'MA']
year = [2015, 2015, 2015, 2016, 2016, 2016, 2017, 2017, 2017, 2018, 2018, 2018]
population = random.sample(range(10000, 300000), 12)

# Build DataFrame
df = pd.DataFrame({'name': name,
                   'year': pd.to_datetime(year, format='%Y'),
                   'pop': population})

# Reshape
df = df.pivot(index='year', columns='name', values='pop')
print(df)

name            MA      NH      RI
year                              
2015-01-01   55710  150339   37453
2016-01-01   66465  149750  223511
2017-01-01  291124  208770   30003
2018-01-01   37211  188676  184167

# Build an hourly DatetimeIndex
idx = pd.date_range(df.index.min(), df.index.max(), freq='H')
print(len(idx))

26305

# Reindex and interpolate with cubicspline as an example
res = df.reindex(idx).interpolate('cubicspline')

# Inspect
print(res.head().round(1))

name                      MA        NH       RI
2015-01-01 00:00:00  55710.0  150339.0  37453.0
2015-01-01 01:00:00  55672.8  150330.3  37523.4
2015-01-01 02:00:00  55635.6  150321.6  37593.9
2015-01-01 03:00:00  55598.4  150312.9  37664.3
2015-01-01 04:00:00  55561.3  150304.2  37734.7

# Plot and visually check if interpolation makes sense
# for your data and purpose
fig, ax = plt.subplots()

color = ['C0', 'C1', 'C2']
res.plot(ax=ax, color=color, legend=False)
df.plot(ax=ax, color=color, marker='o', linewidth=0, clip_on=False)
ax.set_xlabel(None);

【讨论】：

【解决方案2】：

我注意到，即使您在一个名称数组中循环，您也没有在循环动作中使用该名称。所以，你说for i in names，但你没有在循环中使用i。因此，循环的每次迭代都会产生与上一次相同的结果，因为没有使用变量来改变迭代的结果。

由于您将每次迭代都附加到新数据框的底部，因此所有结果都将位于同一列中。因此，您可以为每个名称提取一个小数据框，然后使用该数据进行计算。您还希望将索引设为名称，或者为最终数据框添加一个名为“名称”的列。

类似

names = df_pop['name'].unique()
df = pd.DataFrame(columns = ['name', 'function'])

for i in names:
    condition = df_pop['name'].str.match(i) # define condition where name is i
    mini_df = df_pop[condition] # all rows where condition is met 
    t = mini_df['year']
    y1 = mini_df['population']
    x = pd.DataFrame({'Hours': pd.date_range(f'{start_year}-01-01', f'{end_year}-12-31', freq='1H', closed='left')})

    pop_interp = interp1d(t, y1, x, 'linear') 
    new_row = {name: i, function: pop_interp} # make a new row to append
    df = df.append(new_row, ignore_index = True) # append it

假设 interp1d 是您想要的 - 我对它不是很熟悉 - 我认为这种结构会更好地为每个名称获得独特的结果。

【讨论】：