转换 pandas 数据框：需要更有效的解决方案答案

【问题标题】：Transform a pandas dataframe: need for a more efficient solution转换 pandas 数据框：需要更有效的解决方案
【发布时间】：2021-01-14 23:44:40
【问题描述】：

我有一个按特定时期的日期索引的数据框。我的专栏是对给定年份结束时变量值的预测。我的原始数据框如下所示：

            2016  2017  2018
2016-01-01   0.0     1   NaN
2016-07-01   1.0     1   4.1
2017-01-01   NaN     5   3.0
2017-07-01   NaN     2   2.0

其中 NaN 表示该给定年份的预测不存在。

由于我工作了 20 多年，并且大多数预测都是针对未来 2-3 年的，因此我的真实数据框有 20 多列，其中大部分包含 NaN 值。例如，2005 年的列有 2003-2005 年的预测，但在 2006-2020 年的范围内都是NaN。

我想将我的数据框转换成这样的：

            Y_0  Y_1  Y_2
2016-01-01    0    1  NaN
2016-07-01    1    1  4.1
2017-01,01    5    3  NaN
2017-07-01    2    2  NaN

其中Y_j 表示对year = index.year + j 的预测。这样，我将拥有一个只有 4 列（Y_0、Y_1、Y_2、Y_3）的数据框。

我实际上实现了这一点，但我认为这是一种非常低效的方式：


for i in range(4):
    df[f'Y_{i}'] = numpy.nan  # create columns [Y_0, Y_1, Y_2, Y_3]

for index, row in df.iterrows():  # iterate through each row of df
    
    for year in row.dropna().index:  # iterate through each year where a prediction exists
        
        year_diff = int(year) - index.year # get the difference between the years for which the prediction was made and when it was made (possible values: 0, 1, 2 or 3)
        
        df.loc[index, f'Y_{year_diff}'] = df.loc[index, year]  # set  the values for the columns 'Y_0', 'Y_1', 'Y_2' and 'Y_3' cell by cell.

        df = df.iloc[:, -4:]  # delete all but the new columns

对于只有 1000 行的数据框，这需要将近 3 秒才能运行。谁能想到更好的解决方案？

【问题讨论】：

标签： python pandas dataframe

【解决方案1】：

您可以使用melt 将其转换为长格式，然后根据年份差异转回。

以你的 DataFrame 为例：

df = pd.DataFrame({'date':[datetime.date(2016, 1, 1), datetime.date(2016, 7, 1),
                      datetime.date(2017, 1, 1), datetime.date(2017, 7, 1)],
             2016:[0,1,np.nan,np.nan],
             2017:[1,1,5,2],
             2018:[np.nan, 4.1, 3, 2]})
df = df.melt(id_vars = 'date', value_vars = [2016, 2017, 2018], var_name='prediction_year', value_name='prediction')

长格式：

    date        prediction_year prediction
0   2016-01-01  2016    0.0
1   2016-07-01  2016    1.0
2   2017-01-01  2016    NaN
3   2017-07-01  2016    NaN
4   2016-01-01  2017    1.0
5   2016-07-01  2017    1.0
6   2017-01-01  2017    5.0
7   2017-07-01  2017    2.0
8   2016-01-01  2018    NaN
9   2016-07-01  2018    4.1
10  2017-01-01  2018    3.0
11  2017-07-01  2018    2.0

转换回所需的宽格式：

df['year'] = pd.to_datetime(df['date']).dt.year
df['dt'] = df['prediction_year'] - df['year']
df = df.pivot(index = 'date', columns='dt', values='prediction').dropna(axis = 1, how = 'all').add_prefix('Y_')

            Y_0 Y_1 Y_2
date            
2016-01-01  0.0 1.0 NaN
2016-07-01  1.0 1.0 4.1
2017-01-01  5.0 3.0 NaN
2017-07-01  2.0 2.0 NaN

【讨论】：

【解决方案2】：

让我们试试stack 然后计算年差：

# in index is not already datetime
df.index = pd.to_datetime(df.index)

df = (df.stack().reset_index()
   .assign(date_diff=lambda x: x['level_1'].astype(int) - x['level_0'].dt.year)
   .pivot(index='level_0', columns='date_diff', values=0)
   .add_prefix('Y_')
)

输出：

date_diff   Y_0  Y_1  Y_2
level_0                  
2016-01-01  0.0  1.0  NaN
2016-07-01  1.0  1.0  4.1
2017-01-01  5.0  3.0  NaN
2017-07-01  2.0  2.0  NaN

【讨论】：