【问题标题】:Unstack Groupby does not group the data in proper dataset using PandasUnstack Groupby 没有使用 Pandas 将数据分组到正确的数据集中
【发布时间】:2021-02-06 02:39:42
【问题描述】:

您好,数据科学家和 Pandas 专家,

我需要一些帮助,因为我无法正确组织我的数据。

在 groupby 中使用 unstack 时,它不会正确分组数据。 这是我的数据框:

data = [
{'Store': 'Store1', 'Date': pd.Timestamp('2020-08-01 00:00:00'), 'Employee': 'aemp', 'Department': 'dep1'},\
{'Store': 'Store1', 'Date': pd.Timestamp('2020-08-07 00:00:00'), 'Employee': 'aemp', 'Department': 'dep1'},\
{'Store': 'Store1', 'Date': pd.Timestamp('2020-08-01 00:00:00'), 'Employee': 'bemp', 'Department': 'dep1'},\
{'Store': 'Store1', 'Date': pd.Timestamp('2020-08-07 00:00:00'), 'Employee': 'bemp', 'Department': 'dep1'},\
{'Store': 'Store1', 'Date': pd.Timestamp('2020-08-01 00:00:00'), 'Employee': 'cemp', 'Department': 'dep2'},\
{'Store': 'Store1', 'Date': pd.Timestamp('2020-08-01 00:00:00'), 'Employee': 'demp', 'Department': 'dep2'},\
{'Store': 'Store1', 'Date': pd.Timestamp('2020-08-01 00:00:00'), 'Employee': 'demp', 'Department': 'dep2'},\
{'Store': 'Store1', 'Date': pd.Timestamp('2020-08-01 00:00:00'), 'Employee': 'cemp', 'Department': 'dep2'},\
{'Store': 'Store1', 'Date': pd.Timestamp('2020-08-01 00:00:00'), 'Employee': 'demp', 'Department': 'dep2'},\
{'Store': 'Store1', 'Date': pd.Timestamp('2020-08-07 00:00:00'), 'Employee': 'demp', 'Department': 'dep2'},\
{'Store': 'Store1', 'Date': pd.Timestamp('2020-08-07 00:00:00'), 'Employee': 'demp', 'Department': 'dep2'},\
{'Store': 'Store1', 'Date': pd.Timestamp('2020-08-07 00:00:00'), 'Employee': 'demp', 'Department': 'dep2'},\
{'Store': 'Store1', 'Date': pd.Timestamp('2020-08-07 00:00:00'), 'Employee': 'demp', 'Department': 'dep2'},\
{'Store': 'Store2', 'Date': pd.Timestamp('2020-08-01 00:00:00'), 'Employee': 'eemp', 'Department': 'dep1'},\
{'Store': 'Store2', 'Date': pd.Timestamp('2020-08-07 00:00:00'), 'Employee': 'eemp', 'Department': 'dep1'},\
{'Store': 'Store2', 'Date': pd.Timestamp('2020-08-01 00:00:00'), 'Employee': 'femp', 'Department': 'dep1'},\
{'Store': 'Store2', 'Date': pd.Timestamp('2020-08-07 00:00:00'), 'Employee': 'eemp', 'Department': 'dep1'},\
{'Store': 'Store2', 'Date': pd.Timestamp('2020-08-01 00:00:00'), 'Employee': 'femp', 'Department': 'dep1'},\
{'Store': 'Store2', 'Date': pd.Timestamp('2020-08-07 00:00:00'), 'Employee': 'femp', 'Department': 'dep1'},\
{'Store': 'Store2', 'Date': pd.Timestamp('2020-08-01 00:00:00'), 'Employee': 'aemp', 'Department': 'dep1'},\
{'Store': 'Store2', 'Date': pd.Timestamp('2020-08-07 00:00:00'), 'Employee': 'aemp', 'Department': 'dep1'},\
{'Store': 'Store2', 'Date': pd.Timestamp('2020-08-01 00:00:00'), 'Employee': 'demp', 'Department': 'dep2'},\
{'Store': 'Store2', 'Date': pd.Timestamp('2020-08-01 00:00:00'), 'Employee': 'gemp', 'Department': 'dep2'},\
{'Store': 'Store2', 'Date': pd.Timestamp('2020-08-07 00:00:00'), 'Employee': 'demp', 'Department': 'dep2'},\
{'Store': 'Store2', 'Date': pd.Timestamp('2020-08-05 00:00:00'), 'Employee': 'gemp', 'Department': 'dep2'},\
{'Store': 'Store2', 'Date': pd.Timestamp('2020-08-07 00:00:00'), 'Employee': 'gemp', 'Department': 'dep2'},\
{'Store': 'Store2', 'Date': pd.Timestamp('2020-08-09 00:00:00'), 'Employee': 'cemp', 'Department': 'dep2'},\
{'Store': 'Store3', 'Date': pd.Timestamp('2020-08-01 00:00:00'), 'Employee': 'eemp', 'Department': 'dep1'},\
{'Store': 'Store3', 'Date': pd.Timestamp('2020-08-05 00:00:00'), 'Employee': 'eemp', 'Department': 'dep1'},\
{'Store': 'Store3', 'Date': pd.Timestamp('2020-08-01 00:00:00'), 'Employee': 'bemp', 'Department': 'dep1'},\
{'Store': 'Store3', 'Date': pd.Timestamp('2020-08-05 00:00:00'), 'Employee': 'bemp', 'Department': 'dep1'},\
{'Store': 'Store3', 'Date': pd.Timestamp('2020-08-01 00:00:00'), 'Employee': 'bemp', 'Department': 'dep1'},\
{'Store': 'Store3', 'Date': pd.Timestamp('2020-08-07 00:00:00'), 'Employee': 'demp', 'Department': 'dep2'},\
{'Store': 'Store3', 'Date': pd.Timestamp('2020-08-01 00:00:00'), 'Employee': 'demp', 'Department': 'dep2'}]
df = pd.DataFrame(data)

我想按如下方式组织我的输出:

 Store        Store1                   Store2                            Store3           
 Department   dep1          dep2       dep1           dep2             dep1      dep2   
 Employee      aemp  bemp  cemp demp   aemp eemp femp cemp demp gemp   bemp eemp demp
 Date
 2020-08-03    1.0   1.0   2.0  3.0    1.0  1.0  2.0   0.0  1.0 1.0    2.0  1.0   1.0
 2020-08-10    1.0   1.0   0.0  4.0    1.0  2.0  1.0   1.0  2.0 1.0    1.0  1.0   1.0

我使用了以下 groupby 表达式(我不知道如何按级别对框架进行排序):

df = df.groupby([pd.Grouper(key='Date', freq='W-MON'), 'Store', 'Department', 'Employee'])\
       .size().unstack(['Store', 'Department', 'Employee']).fillna(0)

这是我使用上面的 groupby 表达式时得到的结果:

Store      Store1                Store2                     Store3           Store2
Department   dep1      dep2        dep1           dep2        dep1      dep2   dep2
Employee     aemp bemp cemp demp   aemp eemp femp demp gemp   bemp eemp demp   cemp
Date
2020-08-03    1.0  1.0  2.0  3.0    1.0  1.0  2.0  1.0  1.0    2.0  1.0  1.0    0.0
2020-08-10    1.0  1.0  0.0  4.0    1.0  2.0  1.0  1.0  2.0    1.0  1.0  1.0    1.0

请向我提供您的专家帮助,帮助我解决和修复我的输出,以便所有内容都正确分组。

谢谢你,非常感谢你的帮助。

这是我之前博客的延续:How to show only column with Values in Pandas Groupby

【问题讨论】:

    标签: python pandas pandas-groupby


    【解决方案1】:

    差不多了,你只需要:

    1. 更改.groupby 列的顺序,因为它将按顺序取消堆叠,并且date 需要位于末尾而不是开头或
    2. 您可以按索引排序,但在第 1 步中正确重新排列可以避免您执行此额外步骤。

    重新排列.groupby 列:

    df = (df.groupby(['Store', 'Department', 'Employee', pd.Grouper(key='Date', freq='W-MON'), ])
            .size()
            .unstack(['Store', 'Department', 'Employee']).fillna(0))
    

    在使用sort_index() 取消堆叠之前,按索引排序:

    df = (df.groupby([pd.Grouper(key='Date', freq='W-MON'), 'Store', 'Department', 'Employee'])
            .size()
            .sort_index(level=['Store', 'Department', 'Employee', 'Date'])
            .unstack(['Store', 'Department', 'Employee']).fillna(0))
    Out[1]: 
    Store      Store1                Store2                          Store3       \
    Department   dep1      dep2        dep1           dep2             dep1        
    Employee     aemp bemp cemp demp   aemp eemp femp cemp demp gemp   bemp eemp   
    Date                                                                           
    2020-08-03    1.0  1.0  2.0  3.0    1.0  1.0  2.0  0.0  1.0  1.0    2.0  1.0   
    2020-08-10    1.0  1.0  0.0  4.0    1.0  2.0  1.0  1.0  1.0  2.0    1.0  1.0   
    
    Store            
    Department dep2  
    Employee   demp  
    Date             
    2020-08-03  1.0  
    2020-08-10  1.0
    

    【讨论】:

    • 谢谢大卫。我已经尝试了这两种方法,并且都按照我的要求工作。非常感谢你的帮助。我选择采用第一种方法,因为它会产生更好的性能。谢谢。
    猜你喜欢
    • 2022-08-02
    • 1970-01-01
    • 2023-01-30
    • 2022-12-07
    • 2018-01-29
    • 2019-08-26
    • 2020-12-11
    • 2017-11-05
    • 1970-01-01
    相关资源
    最近更新 更多