【问题标题】:How to replace NaN in python DataFrame with data from crosstab如何用交叉表中的数据替换python DataFrame中的NaN
【发布时间】:2018-07-28 22:49:10
【问题描述】:

早上好,我是熊猫新手。我有一个名为 df 的 DataFrame,它有 4 列:Age、Survived、Pclass 和 Sex(PassengerID = index)。年龄字段的一部分 = NaN

             Age  Survived  Pclass     Sex
PassengerId                             
6            NaN         0       3    male
18           NaN         1       2    male
20           NaN         1       3  female
27           NaN         0       3    male
29           NaN         1       3  female

我想用交叉表中的数据替换 Age NaN。

mean_val = pd.crosstab(index=df["Survived"],columns[df['Sex'],df['Pclass']],values=df['Age'],aggfunc=np.mean)

产生以下内容:

    Sex          female                             male                      
Pclass            1          2          3          1          2          3
Survived                                                                  
0         25.666667  36.000000  23.818182  44.581967  33.369048  27.255814
1         34.939024  28.080882  19.329787  36.248000  16.022000  22.274211

我想做的是这样的:

df['Age'] = mean_val[[df['Sex']][df['Pclass']][df['Survived']]]

我在哪里使用交叉表来查找特定乘客。结果如下所示:

             Age        Survived  Pclass     Sex
PassengerId                             
6            27.255814         0       3    male
18           16.022000         1       2    male
20           19.329787         1       3  female
27           27.255814         0       3    male
29           19.329787         1       3  female

提前感谢您的帮助!

【问题讨论】:

    标签: python-3.x pandas crosstab


    【解决方案1】:

    我认为您需要transform 并将NaNs 替换为means 每个组:

    df['Age'] = (df.groupby(['Survived','Sex','Pclass'])['Age']
                   .transform(lambda x: x.fillna(x.mean())))
    

    如果想使用mean_val 作为输入:

    df = df.join(mean_val.unstack().rename('tmp'), ['Sex','Pclass','Survived'])
    df['Age'] = df['Age'].combine_first(df['tmp'])
    df = df.drop('tmp', axis=1)
    

    示例

    c = ['PassengerId','Age','Survived','Pclass','Sex']
    df = pd.DataFrame({'PassengerId': [6, 18, 20, 27, 29, 16, 118, 120, 127, 129], 
                       'Age': [np.nan, np.nan, np.nan, np.nan, np.nan, 
                               2.0, 3.0, 4.0, 3.0, 4.0], 
                       'Survived': [0, 1, 1, 0, 1, 0, 1, 1, 0, 1], 
                       'Pclass': [3, 2, 3, 3, 3, 3, 2, 3, 3, 3], 
                       'Sex': ['male', 'male', 'female', 'male', 'female', 
                               'male', 'male', 'female', 'male', 'female']},
                       columns=c)
    
    print (df)
       PassengerId  Age  Survived  Pclass     Sex
    0            6  NaN         0       3    male
    1           18  NaN         1       2    male
    2           20  NaN         1       3  female
    3           27  NaN         0       3    male
    4           29  NaN         1       3  female
    5           16  2.0         0       3    male
    6          118  3.0         1       2    male
    7          120  4.0         1       3  female
    8          127  3.0         0       3    male
    9          129  4.0         1       3  female
    

    mean_val = pd.crosstab(index=df["Survived"],columns=[df['Sex'],df['Pclass']],values=df['Age'],aggfunc=np.mean)
    print (mean_val)
    Sex      female male     
    Pclass        3    2    3
    Survived                 
    0           NaN  NaN  2.5
    1           4.0  3.0  NaN
    
    df = df.join(mean_val.unstack().rename('tmp'), ['Sex','Pclass','Survived'])
    df['Age'] = df['Age'].combine_first(df['tmp'])
    df = df.drop('tmp', axis=1)
    print (df)
       PassengerId  Age  Survived  Pclass     Sex
    0            6  2.5         0       3    male
    1           18  3.0         1       2    male
    2           20  4.0         1       3  female
    3           27  2.5         0       3    male
    4           29  4.0         1       3  female
    5           16  2.0         0       3    male
    6          118  3.0         1       2    male
    7          120  4.0         1       3  female
    8          127  3.0         0       3    male
    9          129  4.0         1       3  female
    

    df['Age'] = (df.groupby(['Survived','Sex','Pclass'])['Age']
                   .transform(lambda x: x.fillna(x.mean())))
    
    print (df)
       PassengerId  Age  Survived  Pclass     Sex
    0            6  2.5         0       3    male
    1           18  3.0         1       2    male
    2           20  4.0         1       3  female
    3           27  2.5         0       3    male
    4           29  4.0         1       3  female
    5           16  2.0         0       3    male
    6          118  3.0         1       2    male
    7          120  4.0         1       3  female
    8          127  3.0         0       3    male
    9          129  4.0         1       3  female
    

    【讨论】:

      猜你喜欢
      • 2021-11-20
      • 1970-01-01
      • 2021-02-25
      • 2019-02-12
      • 2023-03-20
      • 1970-01-01
      • 1970-01-01
      • 2018-04-24
      • 1970-01-01
      相关资源
      最近更新 更多