【问题标题】:Fill NaNs from another dataframe with group by用 group by 填充另一个数据帧中的 NaN
【发布时间】:2019-10-01 16:59:24
【问题描述】:

我有 2 个数据框

第一个是这样的

Month DayOfWeek  Class A1  A2 ... A999
July  Monday     Bata  7   9  ... 5
July  Tuesay     Bata  3   1  ... 2
July  Sunday     Bata  4   5  ... 6
July  Monday     Adid  9   8  ... 5
July  Sunday     Adid  4   0  ... 4
Sept  Monday     Nike  7   5  ... 7
Sept  Sunday     Nike  8   3  ... 7
Sept  Satday     Adid  2   7  ... 7
Sept  Monday     Bata  8   9  ... 4
Oct   Monday     Nike  4   2  ... 5
Oct   Sunday     Bata  8   6  ... 3

我的第二个数据框看起来像这样

Month DayOfWeek  Class A1  A2 ... A999
Jul   Monday     Bata  5   7      8
Oct   Monday     Adid  1   2      3
Sep   Monday     Bata  3   7      6
Sep   Monday     Nike  8   3      8
Jul   Monday     Adid  NaN NaN    NaN
Sep   Sunday     Nike  NaN NaN    NaN
Oct   Satday     Nike  NaN NaN    NaN
Sep   Monday     Bata  NaN NaN    NaN

名为 df1 的第一个数据帧没有 NaN 第二个数据帧 df2 的几乎一半是 A1 到 A999 列中的 NaN

列数可变,可以从 A1 到 A10 或从 A1 到 A2567

可以是任意数量的列

我想用来自 df1 的 Same Month 和 DayOfWeek 的平均值填充 df2 中的这些 NaN

我之前发布过另一个问题,但情况发生了变化,它已被分成 2 个数据框和未知数量的列

到目前为止我已经这样做了

Mth = ["Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"]
Wk = ["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"]
for m in Mth:
    for w in Wk:
        print(w,m, df[(df["Month"]==m) & (df["DayOfWeek"]==w) ].mean())

我不知道该去哪里,我怎么能不指定列名来申请所有列

Month DayOfWeek  Class A1  A2 ... A999
Jul   Monday     Bata  5   7      8
Oct   Monday     Adid  1   2      3
Sep   Monday     Bata  3   7      6
Sep   Monday     Nike  8   3      8
Jul   Monday     Adid  NaN NaN    NaN  <--- Avg of Monday Jul in df1 for each column
Sep   Sunday     Nike  NaN NaN    NaN  <--- Avg of Sunday Sep in df1 for each column
Oct   Satday     Nike  NaN NaN    NaN  <--- Avg of Satday Oct in df1 for each column
Sep   Monday     Bata  NaN NaN    NaN  <--- Avg of Monday Sep in df1 for each column

怎么做?

【问题讨论】:

    标签: python dataframe


    【解决方案1】:

    您可以使用下面的groupby,合并和更新功能

    生成虚拟数据

    Mth = ["Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"]
    Wk = ["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"]
    
    def generate(nan=False):
    
        values = np.random.rand(20,20)
        if nan:
            nan_mask = np.random.choice([False,False,True], (20,20))
            values[nan_mask] = np.nan
    
        df = pd.DataFrame(values, columns = [f"A{i}" for i in range(values.shape[1])])
        df_ = pd.DataFrame()
        df_["Month"] = np.random.choice(Mth,20)
        df_["DayOfWeek"] = np.random.choice(Wk,20)
    
        df = pd.concat([df_, df], sort=False, axis=1)
    
    
        return df
    
    df1 = generate()
    df2 = generate(True)
    

    解决方案 首先计算每个组合的均值,然后将均值与原始数据的索引合并,然后使用均值更新原始数据

    means = df1.groupby(["Month", "DayOfWeek"]).mean().reset_index()
    means = df1[["Month", "DayOfWeek"]].merge(means, how="left", on=["Month", "DayOfWeek"])
    
    display(df2)
    df3=df2.copy()
    df3.update(means, overwrite=False)
    display(df3)
    

    【讨论】:

    • df3.update(means, overwrite=False) 导致错误数据重叠
    【解决方案2】:

    我认为这可能有效:

      result = pd.concat([df1, df2]).groupby(['Month','DayOfWeek','Class'], as_index=False,axis=0).mean().dropna()
    

    输出类似于:

         Month DayOfWeek Class   A1   A2  A999
     2   July    Monday  Adid  9.0  8.0   5.0
     3   July    Monday  Bata  7.0  9.0   5.0
     4   July    Sunday  Adid  4.0  0.0   4.0
     5   July    Sunday  Bata  4.0  5.0   6.0
     6   July   Tuesday  Bata  3.0  1.0   2.0
     8    Oct    Monday  Nike  4.0  2.0   5.0
    

    使用 concat 可以组合您的数据框。我想你想按 Month、DayOfWeek 和 Class 分组。这段代码“as_index=False,axis=0”允许您混合不同列大小的数据帧。 当它按“月、星期几和班级”分组时,它会创建所有可能的列:

           Month DayOfWeek Class   A1   A2  A999
      0    Jul    Monday  Adid    NaN  NaN   NaN  
    

    在这种特殊情况下,没有数据并且对打印没有兴趣,解决方案是在末尾添加 dropna()。

    希望对你有帮助。

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2021-12-25
      • 1970-01-01
      • 2021-12-02
      • 2017-01-29
      • 1970-01-01
      • 2021-01-24
      • 2021-08-11
      相关资源
      最近更新 更多