【问题标题】:Adding values for missing data combinations in Pandas为 Pandas 中缺失的数据组合添加值
【发布时间】:2022-11-25 16:29:48
【问题描述】:

我有一个 pandas 数据框,其中包含如下内容:

person_id   status    year    count
0           'pass'    1980    4
0           'fail'    1982    1
1           'pass'    1981    2

如果我知道每个字段的所有可能值是:

all_person_ids = [0, 1, 2]
all_statuses = ['pass', 'fail']
all_years = [1980, 1981, 1982]

我想用 count=0 填充原始数据框以获取缺少的数据组合(person_id、status 和 year),即我希望新数据框包含:

person_id   status    year    count
0           'pass'    1980    4
0           'pass'    1981    0
0           'pass'    1982    0
0           'fail'    1980    0
0           'fail'    1981    0
0           'fail'    1982    2
1           'pass'    1980    0
1           'pass'    1981    2
1           'pass'    1982    0
1           'fail'    1980    0
1           'fail'    1981    0
1           'fail'    1982    0
2           'pass'    1980    0
2           'pass'    1981    0
2           'pass'    1982    0
2           'fail'    1980    0
2           'fail'    1981    0
2           'fail'    1982    0

有没有一种有效的方法可以在熊猫中实现这一目标?

【问题讨论】:

    标签: python pandas


    【解决方案1】:

    您可以使用 itertools.product 生成所有组合,然后从中构造一个 df,merge 它与您的原始 df 以及 fillna 一起用 0 填充缺失的计数值:

    In [77]:
    import itertools
    all_person_ids = [0, 1, 2]
    all_statuses = ['pass', 'fail']
    all_years = [1980, 1981, 1982]
    combined = [all_person_ids, all_statuses, all_years]
    df1 = pd.DataFrame(columns = ['person_id', 'status', 'year'], data=list(itertools.product(*combined)))
    df1
    
    Out[77]:
        person_id status  year
    0           0   pass  1980
    1           0   pass  1981
    2           0   pass  1982
    3           0   fail  1980
    4           0   fail  1981
    5           0   fail  1982
    6           1   pass  1980
    7           1   pass  1981
    8           1   pass  1982
    9           1   fail  1980
    10          1   fail  1981
    11          1   fail  1982
    12          2   pass  1980
    13          2   pass  1981
    14          2   pass  1982
    15          2   fail  1980
    16          2   fail  1981
    17          2   fail  1982
    
    In [82]:    
    df1 = df1.merge(df, how='left').fillna(0)
    df1
    
    Out[82]:
        person_id status  year  count
    0           0   pass  1980      4
    1           0   pass  1981      0
    2           0   pass  1982      0
    3           0   fail  1980      0
    4           0   fail  1981      0
    5           0   fail  1982      1
    6           1   pass  1980      0
    7           1   pass  1981      2
    8           1   pass  1982      0
    9           1   fail  1980      0
    10          1   fail  1981      0
    11          1   fail  1982      0
    12          2   pass  1980      0
    13          2   pass  1981      0
    14          2   pass  1982      0
    15          2   fail  1980      0
    16          2   fail  1981      0
    17          2   fail  1982      0
    

    【讨论】:

      【解决方案2】:

      通过 MultiIndex.from_product() 然后创建一个 MultiIndex set_index()reindex()reset_index()

      import pandas as pd
      import io
      
      all_person_ids = [0, 1, 2]
      all_statuses = ['pass', 'fail']
      all_years = [1980, 1981, 1982]
      df = pd.read_csv(io.BytesIO("""person_id   status    year    count
      0           pass    1980    4
      0           fail    1982    1
      1           pass    1981    2"""), delim_whitespace=True)
      names = ["person_id", "status", "year"]
      
      mind = pd.MultiIndex.from_product(
          [all_person_ids, all_statuses, all_years], names=names)
      df.set_index(names).reindex(mind, fill_value=0).reset_index()
      

      【讨论】:

      • 效果很好——你能大致解释一下上面每一步的作用吗? (我之前不必使用 reindexreset_index,但我会很快阅读它们)。
      • reindex() 将行与新索引对齐,使用fill_value=0 将 NaN 填充为 0。我认为您可以保留 MultiIndex,因为您可以使用它来快速选择元素。通过reset_index(),您可以将索引转换为列。
      • 它主要是一种静态方式,有什么办法可以按日期动态进行吗?
      【解决方案3】:

      你可以使用pyjanitorcomplete方法。

      它接受列名作为输入以及 {name: values} 字典,其中包含要完成的详尽列表:

      import janitor
      df.complete({'person_id': [0,1,2]}, 'status', 'year').fillna(0, downcast='infer')
      

      输出:

          person_id  status  year  count
      0           0  'fail'  1980      0
      1           0  'fail'  1981      0
      2           0  'fail'  1982      1
      3           0  'pass'  1980      4
      4           0  'pass'  1981      0
      5           0  'pass'  1982      0
      6           1  'fail'  1980      0
      7           1  'fail'  1981      0
      8           1  'fail'  1982      0
      9           1  'pass'  1980      0
      10          1  'pass'  1981      2
      11          1  'pass'  1982      0
      12          2  'fail'  1980      0
      13          2  'fail'  1981      0
      14          2  'fail'  1982      0
      15          2  'pass'  1980      0
      16          2  'pass'  1981      0
      17          2  'pass'  1982      0
      

      【讨论】:

        【解决方案4】:
        all_person_ids = [0, 1, 2]
        all_statuses = ['pass', 'fail']
        all_years = [1980, 1981, 1982]
        
        
        pd.Series(all_person_ids).to_frame('person_id').merge(pd.Series(all_statuses).to_frame('status'), how='cross')
            .merge(pd.Series(all_years).to_frame('year'), how='cross')
            .merge(df1,on=['person_id','status','year'], how='left')
            .fillna(0)
        
            person_id status  year  count
        0           0   pass  1980    4.0
        1           0   pass  1981    0.0
        2           0   pass  1982    0.0
        3           0   fail  1980    0.0
        4           0   fail  1981    0.0
        5           0   fail  1982    1.0
        6           1   pass  1980    0.0
        7           1   pass  1981    2.0
        8           1   pass  1982    0.0
        9           1   fail  1980    0.0
        10          1   fail  1981    0.0
        11          1   fail  1982    0.0
        12          2   pass  1980    0.0
        13          2   pass  1981    0.0
        14          2   pass  1982    0.0
        15          2   fail  1980    0.0
        16          2   fail  1981    0.0
        17          2   fail  1982    0.0
        

        【讨论】:

          猜你喜欢
          • 1970-01-01
          • 2013-06-24
          • 2020-08-22
          • 2019-06-06
          • 2018-08-02
          • 2017-08-08
          • 2013-02-15
          • 2013-10-19
          相关资源
          最近更新 更多