【问题标题】:Reshaping data with dates as column values使用日期作为列值重塑数据
【发布时间】:2020-08-19 12:56:24
【问题描述】:

我正在尝试使用 pandas 重塑数据,并且很难将其转换为正确的格式。大致上,数据如下所示*:

df = pd.DataFrame({'PRODUCT':['1','2'],
          'DESIGN_START':[pd.Timestamp('2020-01-05'),pd.Timestamp('2020-01-17')],
          'DESIGN_COMPLETE':[pd.Timestamp('2020-01-22'),pd.Timestamp('2020-03-04')],
          'PRODUCTION_START':[pd.Timestamp('2020-02-07'),pd.Timestamp('2020-03-15')],
          'PRODUCTION_COMPLETE':[np.nan,pd.Timestamp('2020-04-28')]})
print(df)

  PRODUCT DESIGN_START DESIGN_COMPLETE PRODUCTION_START PRODUCTION_COMPLETE
0       1   2020-01-05      2020-01-22       2020-02-07                 NaT
1       2   2020-01-17      2020-03-04       2020-03-15          2020-04-28

我想重塑数据,使其看起来像这样:

reshaped_df = pd.DataFrame({'DATE':[pd.Timestamp('2020-01-05'),pd.Timestamp('2020-01-17'),
                          pd.Timestamp('2020-01-22'),pd.Timestamp('2020-03-04'),
                          pd.Timestamp('2020-02-07'),pd.Timestamp('2020-03-15'),
                          np.nan,pd.Timestamp('2020-04-28')],
                  'STAGE':['design','design','design','design','production','production','production','production'],
                  'STATUS':['started','started','completed','completed','started','started','completed','completed']})

print(reshaped_df)

        DATE       STAGE     STATUS
0 2020-01-05      design    started
1 2020-01-17      design    started
2 2020-01-22      design  completed
3 2020-03-04      design  completed
4 2020-02-07  production    started
5 2020-03-15  production    started
6        NaT  production  completed
7 2020-04-28  production  completed

我该怎么做呢?有没有更好的格式来重塑它?

最后我想对数据做一些分组总结,比如每个步骤发生的次数,例如

reshaped_df.groupby(['STAGE','STATUS'])['DATE'].count()

STAGE       STATUS   
design      completed    2
            started      2
production  completed    1
            started      2
Name: DATE, dtype: int64

谢谢

  • 数据实际上包含许多用于制造管道不同阶段的日期开始/停止列

【问题讨论】:

  • 你需要那个空 null(第 6 行)吗?

标签: python pandas pivot reshape group-summaries


【解决方案1】:

融化它!!!

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'PRODUCT':['1','2'],
    'DESIGN_START':[pd.Timestamp('2020-01-05'),pd.Timestamp('2020-01-17')],
    'DESIGN_COMPLETE':[pd.Timestamp('2020-01-22'),pd.Timestamp('2020-03-04')],
    'PRODUCTION_START':[pd.Timestamp('2020-02-07'),pd.Timestamp('2020-03-15')],
    'PRODUCTION_COMPLETE':[np.nan,pd.Timestamp('2020-04-28')]
})

df = df.melt(id_vars=['PRODUCT'])
df_split = df['variable'].str.split('_', n=1, expand=True)
df['STAGE'] = df_split[0]
df['STATUS'] = df_split[1]
df.drop(columns=['variable'], inplace=True)
df = df.rename(columns={'value': 'DATE'})

print(df)

输出:

  PRODUCT       DATE       STAGE    STATUS
0       1 2020-01-05      DESIGN     START
1       2 2020-01-17      DESIGN     START
2       1 2020-01-22      DESIGN  COMPLETE
3       2 2020-03-04      DESIGN  COMPLETE
4       1 2020-02-07  PRODUCTION     START
5       2 2020-03-15  PRODUCTION     START
6       1        NaT  PRODUCTION  COMPLETE
7       2 2020-04-28  PRODUCTION  COMPLETE

哇哈哈哈哈哈!!!感受融化的力量!!!

Melt 基本上是不可旋转的

【讨论】:

    【解决方案2】:

    删除PRODUCT,将列修改为MultiIndex并堆叠它们:

    new_cols = pd.MultiIndex.from_product([['design', 'production'], ['started', 'completed']], names=['STAGE', 'STATUS'])
    df.drop(columns='PRODUCT') \
        .set_axis(new_cols, axis=1) \
        .stack([0,1]) \
        .groupby(['STAGE', 'STATUS']) \
        .count()
    

    【讨论】:

      【解决方案3】:

      我们可以用stackpd.wide_to_long 并重新排序df

      s=pd.wide_to_long(df,['DESIGN','PRODUCTION'],i='PRODUCT',j='STATUS',suffix='\w+',sep='_').\
           stack(dropna=False).reset_index(level=[1,2]).sort_values('level_2').\
             reset_index(drop=True).rename(columns={'level_2':'STAGE',0:'DATE'})
           STATUS       STAGE       DATE
      0     START      DESIGN 2020-01-05
      1     START      DESIGN 2020-01-17
      2  COMPLETE      DESIGN 2020-01-22
      3  COMPLETE      DESIGN 2020-03-04
      4     START  PRODUCTION 2020-02-07
      5     START  PRODUCTION 2020-03-15
      6  COMPLETE  PRODUCTION        NaT
      7  COMPLETE  PRODUCTION 2020-04-28
      

      【讨论】:

        【解决方案4】:

        将列转换为 lowercasesplit on '_' ...设置 expand=True 将其转换为 MultiIndex:

        df.columns = df.columns.str.lower().str.split('_',expand=True)
        df.columns = df.columns.set_names(['stage','status'])
        
        print(df)
        
        product              design             production
        NaN       start     complete    start      complete
        0   1   2020-01-05  2020-01-22  2020-02-07  NaT
        1   2   2020-01-17  2020-03-04  2020-03-15  2020-04-28
        

        下一阶段是stacksort valuesdroplevelreset indexreindex 的组合:

        res = (df
               .stack([0,1])
               .sort_values()
               .droplevel(0)
               .reset_index(name='Date')
               .reindex(['Date','stage','status'],axis=1)
              )
        
        res
        
        
              DATE      STAGE       STATUS
        0   2020-01-05  design      start
        1   2020-01-17  design      start
        2   2020-01-22  design      complete
        3   2020-02-07  production  start
        4   2020-03-04  design      complete
        5   2020-03-15  production  start
        6   2020-04-28  production  complete
        

        如果您对获取分组和聚合感兴趣,那么您可以跳过长路径并在堆栈之后起飞:

        df.stack([0,1]).groupby(['stage','status']).count()
        
        
          stage       status  
        design      complete    2
                    start       2
        production  complete    1
                    start       2
        Name: Date, dtype: int64
        

        更新 2021/06/01:

        您可以使用pyjanitor 中的pivot_longer 函数来抽象整形;目前你必须从github安装最新的开发版本:

          # install the latest dev version of pyjanitor
          # pip install git+https://github.com/ericmjl/pyjanitor.git
           import janitor
           df.rename(columns=str.lower).pivot_longer(
            index="product",
            names_sep="_",
            names_to=("stage", "status"),
            values_to="date",
        )
        
          product   stage      status       date
        0   1       design      start       2020-01-05
        1   2       design      start       2020-01-17
        2   1       design      complete    2020-01-22
        3   2       design      complete    2020-03-04
        4   1       production  start       2020-02-07
        5   2       production  start       2020-03-15
        6   1       production  complete    NaT
        7   2       production  complete    2020-04-28
        

        【讨论】:

          猜你喜欢
          • 1970-01-01
          • 2017-03-31
          • 2014-12-06
          • 1970-01-01
          • 1970-01-01
          • 2021-09-22
          • 1970-01-01
          • 2011-12-26
          • 1970-01-01
          相关资源
          最近更新 更多