【问题标题】:Prevent Pandas read_csv from interpreting NA as NaN but retaining NaN for empty values防止 Pandas read_csv 将 NA 解释为 NaN 但将 NaN 保留为空值
【发布时间】:2020-02-19 23:41:59
【问题描述】:

我的问题与此one 有关。我有一个名为“test.csv”的文件,其中“NA”作为region 的值。我想把它读成“NA”,而不是“NaN”。但是,test.csv 中的其他列中存在缺失值,我想将其保留为“NaN”。我怎样才能做到这一点?

# test.csv looks like this:

这是我尝试过的:

import pandas as pd
# This reads NA as NaN
df = pd.read_csv(test.csv)
df
    region  date    expenses
0   NaN   1/1/2019  53
1   EU    1/2/2019  NaN

# This reads NA as NA, but doesn't read missing expense as NaN
df = pd.read_csv('test.csv', keep_default_na=False, na_values='_')
df
    region  date    expenses
0   NA    1/1/2019  53
1   EU    1/2/2019  

# What I want:
    region  date    expenses
0   NA    1/1/2019  53
1   EU    1/2/2019  NaN

添加参数keep_default_na=False 的问题是expenses 的第二个值没有被读入为NaN。因此,如果我尝试pd.isnull(df['value'][1]),则返回为False

【问题讨论】:

  • 在链接的帖子中,null 值用下划线表示,因此它们设置为na_values='_'。在您的情况下,缺少的数据似乎由空字符串表示,所以我会选择na_values=''(除了keep_default_na=False)如果这解决了您的问题,那么这显然是一个重复。

标签: python pandas csv nan


【解决方案1】:

这种方法对我有用:

import pandas as pd
df = pd.read_csv('Test.csv')
co1 col2  col3  col4
a   b    c  d   e
NaN NaN NaN NaN NaN
2   3   4   5   NaN

我复制了该值并创建了一个默认解释为 NaN 的列表,然后注释掉我想要解释为非 NaN 的 NA。此方法仍将除 NA 之外的其他值视为 NaN。

#You can also create your own list of value that should be treated as NaN and 
# then pass the values to na_values and set keep_default_na=False.
        na_values = ["", 
                     "#N/A", 
                     "#N/A N/A", 
                     "#NA", 
                     "-1.#IND", 
                     "-1.#QNAN", 
                     "-NaN", 
                     "-nan", 
                     "1.#IND", 
                     "1.#QNAN", 
                     "<NA>", 
                     "N/A", 
        #              "NA", 
                     "NULL", 
                     "NaN", 
                     "n/a", 
                     "nan", 
                     "null"]
    
        df1 = pd.read_csv('Test.csv',na_values=na_values,keep_default_na=False )
    
              co1  col2  col3  col4
        a     b     c     d     e
        NaN  NA   NaN    NA   NaN
        2     3     4     5   NaN

【讨论】:

    【解决方案2】:

    对我来说,这是可行的:

    df = pd.read_csv('file.csv', keep_default_na=False, na_values=[''])
    

    给出:

      region      date  expenses
    0     NA  1/1/2019      53.0
    1     EU  1/2/2019       NaN
    

    但我宁愿谨慎行事,因为其他列中可能还有其他NaN,并且这样做

    df = pd.read_csv('file.csv')
    df['region'] = df['region'].fillna('NA')
    

    【讨论】:

      【解决方案3】:

      当指定 keep_default=False 时,所有默认值都不会被视为 nan,因此您应该指定它们:

      使用keep_default_na=False, na_values= [‘’, ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, ‘-1.#QNAN’, ‘-NaN’, ‘-nan’, ‘1.#IND’, ‘1.#QNAN’, ‘N/A’, ‘NULL’, ‘NaN’, ‘n/a’, ‘nan’, ‘null’]

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 2017-06-21
        • 2020-03-26
        • 2013-01-21
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多