【问题标题】:inserting missing categories and dates in pandas dataframe在熊猫数据框中插入缺失的类别和日期
【发布时间】:2021-08-27 10:36:44
【问题描述】:

我有以下数据框。我想为每个组(a、b、c、d)和所有日期(有两个日期 - 2020-06-01 和 2020-06-02)添加所有分数级别(高、中、低)

x = pd.DataFrame(data={ 'date'  : ['2020-06-01','2020-06-01','2020-06-02','2020-06-01','2020-06-02','2020-06-01','2020-06-02','2020-06-02','2020-06-02'],
                        'group' : ['a','a','a','b','b','c','c','c','d'],
                        'score' : ['high','low','mid','low','high','high','high','mid','high'],
                        'count' : [12,13,2,19,22,3,4,49,12]})

我可以添加以下所有科目的分数类别,但我也无法添加日期

cats = ['high', 'mid','low'] 
x_re = pd.DataFrame(list(product(x['group'].unique(), cats)),columns=['group', 'score'])
x_re.merge(x, how='left').fillna(0)

预期的输出是这样的:每个主题有 6 行,每个日期有 3 行,每个分数类别有 1 行。然后在缺少数据点的地方用 np.nan 填充计数(或者零也可以)

pd.DataFrame(data={ 'date'  : ['2020-06-01','2020-06-01','2020-06-01','2020-06-02','2020-06-02','2020-06-02','2020-06-01','2020-06-01','2020-06-01','2020-06-02','2020-06-02','2020-06-02','2020-06-01','2020-06-01','2020-06-01','2020-06-02','2020-06-02','2020-06-02','2020-06-01','2020-06-01','2020-06-01','2020-06-02','2020-06-02','2020-06-02'],                        
                        'group' : ['a','a','a','a','a','a','b','b','b','b','b','b','c','c','c','c','c','c','d','d','d','d','d','d'],
                        'score' : ['high','low','mid','high','low','mid','high','low','mid','high','low','mid','high','low','mid','high','low','mid','high','low','mid','high','low','mid'],
                        'count' : [12, 13, np.nan, np.nan, np.nan, 2, np.nan, 22, np.nan, 19, np.nan, np.nan, 3, np.nan, np.nan, 4, np.nan, np.nan, np.nan, np.nan, np.nan, 12, np.nan, 49]})

任何建议都会很棒,谢谢

【问题讨论】:

    标签: python pandas dataframe grouping


    【解决方案1】:

    您的解决方案可以通过添加 date 列的唯一值进行修改,如果不是唯一的三元组 date, group, score 在输入数据中,此解决方案可以工作:

    cats = ['high', 'mid','low'] 
    x_re = pd.DataFrame(list(product(x['date'].unique(), 
                                     x['group'].unique(), 
                                     cats)),columns=['date','group', 'score'])
    x = x_re.merge(x, how='left').fillna(0)
    

    reindex by 3 level MultiIndex 的解决方案类似:

    cats = ['high', 'mid','low'] 
    x_re = pd.MultiIndex.from_product([x['date'].unique(), 
                                       x['group'].unique(),
                                       cats],names=['date','group', 'score'])
    
    x = x.set_index(['date','group','score']).reindex(x_re).reset_index()
    print (x)
              date group score  count
    0   2020-06-01     a  high   12.0
    1   2020-06-01     a   mid    NaN
    2   2020-06-01     a   low   13.0
    3   2020-06-01     b  high    NaN
    4   2020-06-01     b   mid    NaN
    5   2020-06-01     b   low   19.0
    6   2020-06-01     c  high    3.0
    7   2020-06-01     c   mid    NaN
    8   2020-06-01     c   low    NaN
    9   2020-06-01     d  high    NaN
    10  2020-06-01     d   mid    NaN
    11  2020-06-01     d   low    NaN
    12  2020-06-02     a  high    NaN
    13  2020-06-02     a   mid    2.0
    14  2020-06-02     a   low    NaN
    15  2020-06-02     b  high   22.0
    16  2020-06-02     b   mid    NaN
    17  2020-06-02     b   low    NaN
    18  2020-06-02     c  high    4.0
    19  2020-06-02     c   mid   49.0
    20  2020-06-02     c   low    NaN
    21  2020-06-02     d  high   12.0
    22  2020-06-02     d   mid    NaN
    23  2020-06-02     d   low    NaN
    

    一键调用unstack 和一键调用stack 是可能的,但所有唯一值cats 必须存在于输入数据中:

    x = (x.set_index(['date', 'group', 'score'])
          .unstack(['group','score'])
          .stack([1, 2], dropna=False)
          .reset_index())
    print (x)
              date group score  count
    0   2020-06-01     a  high   12.0
    1   2020-06-01     a   low   13.0
    2   2020-06-01     a   mid    NaN
    3   2020-06-01     b  high    NaN
    4   2020-06-01     b   low   19.0
    5   2020-06-01     b   mid    NaN
    6   2020-06-01     c  high    3.0
    7   2020-06-01     c   low    NaN
    8   2020-06-01     c   mid    NaN
    9   2020-06-01     d  high    NaN
    10  2020-06-01     d   low    NaN
    11  2020-06-01     d   mid    NaN
    12  2020-06-02     a  high    NaN
    13  2020-06-02     a   low    NaN
    14  2020-06-02     a   mid    2.0
    15  2020-06-02     b  high   22.0
    16  2020-06-02     b   low    NaN
    17  2020-06-02     b   mid    NaN
    18  2020-06-02     c  high    4.0
    19  2020-06-02     c   low    NaN
    20  2020-06-02     c   mid   49.0
    21  2020-06-02     d  high   12.0
    22  2020-06-02     d   low    NaN
    23  2020-06-02     d   mid    NaN
    

    【讨论】:

      【解决方案2】:

      当你没有太多级别时,一个简单的方法是unstack/stack:

      (x.set_index(['date', 'group', 'score'])
        .unstack('group').stack(dropna=False)
        .unstack('score').stack(dropna=False)
        .reset_index()
      )
      

      输出:

                date group score  count
      0   2020-06-01     a  high   12.0
      1   2020-06-01     a   low   13.0
      2   2020-06-01     a   mid    NaN
      3   2020-06-01     b  high    NaN
      4   2020-06-01     b   low   19.0
      5   2020-06-01     b   mid    NaN
      6   2020-06-01     c  high    3.0
      7   2020-06-01     c   low    NaN
      8   2020-06-01     c   mid    NaN
      9   2020-06-01     d  high    NaN
      10  2020-06-01     d   low    NaN
      11  2020-06-01     d   mid    NaN
      12  2020-06-02     a  high    NaN
      13  2020-06-02     a   low    NaN
      14  2020-06-02     a   mid    2.0
      15  2020-06-02     b  high   22.0
      16  2020-06-02     b   low    NaN
      17  2020-06-02     b   mid    NaN
      18  2020-06-02     c  high    4.0
      19  2020-06-02     c   low    NaN
      20  2020-06-02     c   mid   49.0
      21  2020-06-02     d  high   12.0
      22  2020-06-02     d   low    NaN
      23  2020-06-02     d   mid    NaN
      

      【讨论】:

        【解决方案3】:

        如果我理解正确,您可以使用来自pyjanitorcomplete 函数将其抽象化:

        # pip install pyjanitor
        import pandas as pd
        import janitor
        x.complete(['date', 'group', 'score'])
        
                  date group score  count
        0   2020-06-01     a  high   12.0
        1   2020-06-01     a   low   13.0
        2   2020-06-01     a   mid    NaN
        3   2020-06-01     b  high    NaN
        4   2020-06-01     b   low   19.0
        5   2020-06-01     b   mid    NaN
        6   2020-06-01     c  high    3.0
        7   2020-06-01     c   low    NaN
        8   2020-06-01     c   mid    NaN
        9   2020-06-01     d  high    NaN
        10  2020-06-01     d   low    NaN
        11  2020-06-01     d   mid    NaN
        12  2020-06-02     a  high    NaN
        13  2020-06-02     a   low    NaN
        14  2020-06-02     a   mid    2.0
        15  2020-06-02     b  high   22.0
        16  2020-06-02     b   low    NaN
        17  2020-06-02     b   mid    NaN
        18  2020-06-02     c  high    4.0
        19  2020-06-02     c   low    NaN
        20  2020-06-02     c   mid   49.0
        21  2020-06-02     d  high   12.0
        22  2020-06-02     d   low    NaN
        23  2020-06-02     d   mid    NaN
        

        【讨论】:

          猜你喜欢
          • 2018-04-17
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 2021-05-31
          • 2019-05-30
          • 1970-01-01
          相关资源
          最近更新 更多