【问题标题】:Prevent pandas from reading "NA" as NaN防止 pandas 将“NA”读为 NaN
【发布时间】:2017-01-01 16:58:10
【问题描述】:

我正在读取的 .csv 文件包含值为“NA”的单元格。 Pandas 会自动将这些转换为 NaN,这是我不想要的。我知道keep_default_na=False 参数,但这会将列的dtype 更改为object,这意味着pd.get_dummies 无法正常工作。

有什么方法可以防止 pandas 在不更改 dtype 的情况下将“NA”读取为 NaN?

【问题讨论】:

  • 从你的问题的措辞来看,听起来你想让熊猫读取字符串“NA”并将其存储在非对象列(例如浮点数或整数列)中。是这样吗?如果是这样,不,这是不可能的。
  • @ajcr 哎呀,你是对的。直到现在我才意识到这是多么荒谬。回到绘图板我猜。

标签: python pandas


【解决方案1】:

keep_default_na=False 为我工作

from io import StringIO
import pandas as pd

txt = """col1,col2
a,b
NA,US"""

print(pd.read_csv(StringIO(txt), keep_default_na=False))

  col1 col2
0    a    b
1   NA   US

没有它

print(pd.read_csv(StringIO(txt)))

  col1 col2
0    a    b
1  NaN   US

【讨论】:

  • 另外你应该指定'na_values'是你有一些你必须被解释为空的空值。你可以这样做:na_values=['NULL','null', 'nan','NaN']
【解决方案2】:

这是 Pandas 文档给出的:

na_values : scalar, str, list-like, or dict, optional
Additional strings to recognize as NA/NaN. If dict passed, specific per-column NA values. By default the following values are interpreted as NaN: ‘’, ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, ‘-1.#QNAN’, ‘-NaN’, ‘-nan’, ‘1.#IND’, ‘1.#QNAN’, ‘N/A’, ‘NA’, ‘NULL’, ‘NaN’, ‘n/a’, ‘nan’, ‘null’.

keep_default_na : bool, default True
Whether or not to include the default NaN values when parsing the data. Depending on whether na_values is passed in, the behavior is as follows:

If keep_default_na is True, and na_values are specified, na_values is appended to the default NaN values used for parsing.
If keep_default_na is True, and na_values are not specified, only the default NaN values are used for parsing.
If keep_default_na is False, and na_values are specified, only the NaN values specified na_values are used for parsing.
If keep_default_na is False, and na_values are not specified, no strings will be parsed as NaN.
Note that if na_filter is passed in as False, the keep_default_na and na_values parameters will be ignored.

【讨论】:

    【解决方案3】:

    这种方法对我有用:

    import pandas as pd
    df = pd.read_csv('Test.csv')
    co1 col2  col3  col4
    a   b    c  d   e
    NaN NaN NaN NaN NaN
    2   3   4   5   NaN
    

    我复制了该值并创建了一个默认解释为 NaN 的列表,然后注释掉我想要解释为非 NaN 的 NA。此方法仍将除 NA 之外的其他值视为 NaN。

     na_values = ["", 
                 "#N/A", 
                 "#N/A N/A", 
                 "#NA", 
                 "-1.#IND", 
                 "-1.#QNAN", 
                 "-NaN", 
                 "-nan", 
                 "1.#IND", 
                 "1.#QNAN", 
                 "<NA>", 
                 "N/A", 
    #              "NA", 
                 "NULL", 
                 "NaN", 
                 "n/a", 
                 "nan", 
                 "null"]
    df1 = pd.read_csv('Test.csv',na_values=na_values,keep_default_na=False )
    
          co1  col2  col3  col4
    a     b     c     d     e
    NaN  NA   NaN    NA   NaN
    2     3     4     5   NaN
    

    【讨论】:

      【解决方案4】:

      您可以尝试先将列转换为str:

      for index, row in df.iterrows():
          na_column = str(row['your_row'])
          if na_column != 'nan':
              # do something on column
      

      【讨论】:

        猜你喜欢
        • 2017-06-21
        • 2020-02-19
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2021-03-22
        • 2019-02-28
        • 1970-01-01
        • 2018-02-08
        相关资源
        最近更新 更多