【问题标题】:`unique` in data.frame.describe() not work [python][pandas]data.frame.describe() 中的 `unique` 不起作用 [python] [pandas]
【发布时间】:2016-02-15 05:29:41
【问题描述】:

您好,这是基本问题,但我无法修复...unique() 在每列中显示唯一值,但 describe() 显示 NaN。为什么...任何帮助表示赞赏。谢谢

import numpy as np
import pandas as pd

train = pd.read_csv('train.csv', header=0)

# works:
train['Pclass'].unique()
# array([3, 1, 2], dtype=int64)
train['Survived'].unique()
# array([0, 1], dtype=int64)

# not work:
train.describe(include='all')
#         PassengerId    Survived      Pclass               Name   Sex  \
# count    891.000000  891.000000  891.000000                891   891   
# unique          NaN         NaN         NaN                891     2   
# top             NaN         NaN         NaN  Mitkoff, Mr. Mito  male   
# freq            NaN         NaN         NaN                  1   577   
# mean     446.000000    0.383838    2.308642                NaN   NaN   
# std      257.353842    0.486592    0.836071                NaN   NaN   
# min        1.000000    0.000000    1.000000                NaN   NaN   
# 25%      223.500000    0.000000    2.000000                NaN   NaN   
# 50%      446.000000    0.000000    3.000000                NaN   NaN   
# 75%      668.500000    1.000000    3.000000                NaN   NaN   
# max      891.000000    1.000000    3.000000                NaN   NaN   
# 
#                Age       SibSp       Parch  Ticket        Fare        Cabin  \
# count   714.000000  891.000000  891.000000     891  891.000000          204   
# unique         NaN         NaN         NaN     681         NaN          147   
# top            NaN         NaN         NaN  347082         NaN  C23 C25 C27   
# freq           NaN         NaN         NaN       7         NaN            4   
# mean     29.699118    0.523008    0.381594     NaN   32.204208          NaN   
# std      14.526497    1.102743    0.806057     NaN   49.693429          NaN   
# min       0.420000    0.000000    0.000000     NaN    0.000000          NaN   
# 25%      20.125000    0.000000    0.000000     NaN    7.910400          NaN   
# 50%      28.000000    0.000000    0.000000     NaN   14.454200          NaN   
# 75%      38.000000    1.000000    0.000000     NaN   31.000000          NaN   
# max      80.000000    8.000000    6.000000     NaN  512.329200          NaN   
# 
#        Embarked  
# count       889  
# unique        3  
# top           S  
# freq        644  
# mean        NaN  
# std         NaN  
# min         NaN  
# 25%         NaN  
# 50%         NaN  
# 75%         NaN  
# max         NaN  

【问题讨论】:

    标签: python pandas unique describe


    【解决方案1】:

    数字列的describe 方法没有列出唯一值的数量,因为这通常对数字数据没有特别的意义,而字符串列的describe 方法可以:

    import pandas as pd
    df = pd.DataFrame({'string_column': ['a', 'a', 'b'], 'numeric': [1, 2, 1]})
    
    df['numeric'].describe()
    Out[6]: 
    count    3.000000
    mean     1.333333
    std      0.577350
    min      1.000000
    25%      1.000000
    50%      1.000000
    75%      1.500000
    max      2.000000
    Name: numeric, dtype: float64
    
    df['string_column'].describe()
    Out[7]: 
    count     3
    unique    2
    top       a
    freq      2
    Name: string_column, dtype: object
    

    由于您的数据框同时包含两者,因此将合并结果并在列没有该值的位置插入nans。

    如果您的数字列实际上只是反映不同类别/类别的代码,您可能需要将它们转换为 Categorical 以获取有关它们的更有意义的信息:

    df['categorized'] = pd.Categorical(df['numeric'])
    
    df['categorized'].describe()
    Out[10]: 
    count     3
    unique    2
    top       1
    freq      2
    Name: categorized, dtype: int64
    

    【讨论】:

    • v 有助于数字 vs 字符串和pd.Categorical的使用
    猜你喜欢
    • 1970-01-01
    • 2018-10-29
    • 1970-01-01
    • 1970-01-01
    • 2022-01-19
    • 2016-09-01
    • 2013-04-24
    • 1970-01-01
    相关资源
    最近更新 更多