在熊猫数据框中插入缺失的类别和日期答案

【问题标题】：inserting missing categories and dates in pandas dataframe在熊猫数据框中插入缺失的类别和日期
【发布时间】：2021-08-27 10:36:44
【问题描述】：

我有以下数据框。我想为每个组（a、b、c、d）和所有日期（有两个日期 - 2020-06-01 和 2020-06-02）添加所有分数级别（高、中、低）

x = pd.DataFrame(data={ 'date'  : ['2020-06-01','2020-06-01','2020-06-02','2020-06-01','2020-06-02','2020-06-01','2020-06-02','2020-06-02','2020-06-02'],
                        'group' : ['a','a','a','b','b','c','c','c','d'],
                        'score' : ['high','low','mid','low','high','high','high','mid','high'],
                        'count' : [12,13,2,19,22,3,4,49,12]})

我可以添加以下所有科目的分数类别，但我也无法添加日期

cats = ['high', 'mid','low'] 
x_re = pd.DataFrame(list(product(x['group'].unique(), cats)),columns=['group', 'score'])
x_re.merge(x, how='left').fillna(0)

预期的输出是这样的：每个主题有 6 行，每个日期有 3 行，每个分数类别有 1 行。然后在缺少数据点的地方用 np.nan 填充计数（或者零也可以）

pd.DataFrame(data={ 'date'  : ['2020-06-01','2020-06-01','2020-06-01','2020-06-02','2020-06-02','2020-06-02','2020-06-01','2020-06-01','2020-06-01','2020-06-02','2020-06-02','2020-06-02','2020-06-01','2020-06-01','2020-06-01','2020-06-02','2020-06-02','2020-06-02','2020-06-01','2020-06-01','2020-06-01','2020-06-02','2020-06-02','2020-06-02'],                        
                        'group' : ['a','a','a','a','a','a','b','b','b','b','b','b','c','c','c','c','c','c','d','d','d','d','d','d'],
                        'score' : ['high','low','mid','high','low','mid','high','low','mid','high','low','mid','high','low','mid','high','low','mid','high','low','mid','high','low','mid'],
                        'count' : [12, 13, np.nan, np.nan, np.nan, 2, np.nan, 22, np.nan, 19, np.nan, np.nan, 3, np.nan, np.nan, 4, np.nan, np.nan, np.nan, np.nan, np.nan, 12, np.nan, 49]})

任何建议都会很棒，谢谢

【问题讨论】：

标签： python pandas dataframe grouping

【解决方案1】：

您的解决方案可以通过添加 date 列的唯一值进行修改，如果不是唯一的三元组 date, group, score 在输入数据中，此解决方案可以工作：

cats = ['high', 'mid','low'] 
x_re = pd.DataFrame(list(product(x['date'].unique(), 
                                 x['group'].unique(), 
                                 cats)),columns=['date','group', 'score'])
x = x_re.merge(x, how='left').fillna(0)

reindex by 3 level MultiIndex 的解决方案类似：

cats = ['high', 'mid','low'] 
x_re = pd.MultiIndex.from_product([x['date'].unique(), 
                                   x['group'].unique(),
                                   cats],names=['date','group', 'score'])

x = x.set_index(['date','group','score']).reindex(x_re).reset_index()
print (x)
          date group score  count
0   2020-06-01     a  high   12.0
1   2020-06-01     a   mid    NaN
2   2020-06-01     a   low   13.0
3   2020-06-01     b  high    NaN
4   2020-06-01     b   mid    NaN
5   2020-06-01     b   low   19.0
6   2020-06-01     c  high    3.0
7   2020-06-01     c   mid    NaN
8   2020-06-01     c   low    NaN
9   2020-06-01     d  high    NaN
10  2020-06-01     d   mid    NaN
11  2020-06-01     d   low    NaN
12  2020-06-02     a  high    NaN
13  2020-06-02     a   mid    2.0
14  2020-06-02     a   low    NaN
15  2020-06-02     b  high   22.0
16  2020-06-02     b   mid    NaN
17  2020-06-02     b   low    NaN
18  2020-06-02     c  high    4.0
19  2020-06-02     c   mid   49.0
20  2020-06-02     c   low    NaN
21  2020-06-02     d  high   12.0
22  2020-06-02     d   mid    NaN
23  2020-06-02     d   low    NaN

一键调用unstack 和一键调用stack 是可能的，但所有唯一值cats 必须存在于输入数据中：

x = (x.set_index(['date', 'group', 'score'])
      .unstack(['group','score'])
      .stack([1, 2], dropna=False)
      .reset_index())
print (x)
          date group score  count
0   2020-06-01     a  high   12.0
1   2020-06-01     a   low   13.0
2   2020-06-01     a   mid    NaN
3   2020-06-01     b  high    NaN
4   2020-06-01     b   low   19.0
5   2020-06-01     b   mid    NaN
6   2020-06-01     c  high    3.0
7   2020-06-01     c   low    NaN
8   2020-06-01     c   mid    NaN
9   2020-06-01     d  high    NaN
10  2020-06-01     d   low    NaN
11  2020-06-01     d   mid    NaN
12  2020-06-02     a  high    NaN
13  2020-06-02     a   low    NaN
14  2020-06-02     a   mid    2.0
15  2020-06-02     b  high   22.0
16  2020-06-02     b   low    NaN
17  2020-06-02     b   mid    NaN
18  2020-06-02     c  high    4.0
19  2020-06-02     c   low    NaN
20  2020-06-02     c   mid   49.0
21  2020-06-02     d  high   12.0
22  2020-06-02     d   low    NaN
23  2020-06-02     d   mid    NaN

【讨论】：

【解决方案2】：

当你没有太多级别时，一个简单的方法是unstack/stack:

(x.set_index(['date', 'group', 'score'])
  .unstack('group').stack(dropna=False)
  .unstack('score').stack(dropna=False)
  .reset_index()
)

输出：

          date group score  count
0   2020-06-01     a  high   12.0
1   2020-06-01     a   low   13.0
2   2020-06-01     a   mid    NaN
3   2020-06-01     b  high    NaN
4   2020-06-01     b   low   19.0
5   2020-06-01     b   mid    NaN
6   2020-06-01     c  high    3.0
7   2020-06-01     c   low    NaN
8   2020-06-01     c   mid    NaN
9   2020-06-01     d  high    NaN
10  2020-06-01     d   low    NaN
11  2020-06-01     d   mid    NaN
12  2020-06-02     a  high    NaN
13  2020-06-02     a   low    NaN
14  2020-06-02     a   mid    2.0
15  2020-06-02     b  high   22.0
16  2020-06-02     b   low    NaN
17  2020-06-02     b   mid    NaN
18  2020-06-02     c  high    4.0
19  2020-06-02     c   low    NaN
20  2020-06-02     c   mid   49.0
21  2020-06-02     d  high   12.0
22  2020-06-02     d   low    NaN
23  2020-06-02     d   mid    NaN

【讨论】：

【解决方案3】：

如果我理解正确，您可以使用来自pyjanitor 的complete 函数将其抽象化：

# pip install pyjanitor
import pandas as pd
import janitor
x.complete(['date', 'group', 'score'])

          date group score  count
0   2020-06-01     a  high   12.0
1   2020-06-01     a   low   13.0
2   2020-06-01     a   mid    NaN
3   2020-06-01     b  high    NaN
4   2020-06-01     b   low   19.0
5   2020-06-01     b   mid    NaN
6   2020-06-01     c  high    3.0
7   2020-06-01     c   low    NaN
8   2020-06-01     c   mid    NaN
9   2020-06-01     d  high    NaN
10  2020-06-01     d   low    NaN
11  2020-06-01     d   mid    NaN
12  2020-06-02     a  high    NaN
13  2020-06-02     a   low    NaN
14  2020-06-02     a   mid    2.0
15  2020-06-02     b  high   22.0
16  2020-06-02     b   low    NaN
17  2020-06-02     b   mid    NaN
18  2020-06-02     c  high    4.0
19  2020-06-02     c   low    NaN
20  2020-06-02     c   mid   49.0
21  2020-06-02     d  high   12.0
22  2020-06-02     d   low    NaN
23  2020-06-02     d   mid    NaN

【讨论】：