自定义 pandas dtype 推断答案

【问题标题】：Custom pandas dtype inferring自定义 pandas dtype 推断
【发布时间】：2021-04-18 07:46:15
【问题描述】：

我正在处理多个大的 .csv 文件，每个文件都有许多不同的变量，将来可能会出现更多。

问题是，pandas 默认推断类型的方式不符合我的需求。例如，某些行没有值的数值变量最终会被解释为 float64，即使它们本应用作整数。

例如，我希望将 Int64Dtype 优先于 float64，而不必手动制作庞大的 dtypes 字典。

一个肮脏的解决方案是读取 .csv，用我自己的算法检查每个变量以构成我自己的 dtypes 字典，然后用字典重新打开 .csv 或更改每个变量。

我想知道是否有一种简单的方法来使用自定义推断，甚至只是为 dtype 检查设置不同的顺序，但一直找不到。

【问题讨论】：

标签： python pandas casting type-inference

【解决方案1】：

我想知道pandas.read_csv 的dtype 参数是否不是您想要的？您可以使用列名字典作为参数指定列的类型。

另一种方法是在float64 列的转换中使用试错法：

for col, dtype in df.dtypes:
   if dtype == 'float64':
       try:
           df[col] = df[col].as_type('int64')
       except ValueError:
           pass

pandas提供的也有2种可能：

您可以使用to_numeric Series 方法执行与上述相同的操作，该方法在给定数据的情况下将类型向下转换为尽可能小的：

for col, dtype in df.dtypes:
   df[col] = df[col].to_numeric(downcast='integer')

同样，您可以使用convert_dtypes 同时转换整数和浮点数：

for col, dtype in df.dtypes:
   df[col] = df[col].convert_dtypes(convert_integer=True, convert_floating=True, convert_string=False, convert_boolean=False)

对于好奇的人：

pandas 坚持使用浮动类型的原因是possibility to represent missing values。

【讨论】：

这就是我想要避免的，因为我必须首先制作字典，如果在添加新变量时手动完成，则需要更新（并且已经有很多其中），或者如果通过算法完成，则需要第一次读取 .csv 文件。
@Uretki：其他方法是否符合您的需求？
是的，我正在考虑更接近 정성헌 建议的更改解析器行为以进行类型猜测的方法，但以更自然的方式使用我在搜索时会错过的晦涩参数，而不是更改现有代码。最后我会坚持你的解决方案，它仍然可以满足需求。

【解决方案2】：

如果你想改变算法

当方法存在时，你必须去该位置。

位置可以通过代码知道

import pandas
import inspect
import os
os.path.dirname(inspect.getfile(pandas.read_csv))

也许它会返回 ~/~/~/~/~/lib/site-packages/pandas/io

你进入parsers.py并找到代码

from pandas import Int64Dtype # it have to added

def _infer_types(self, values, na_values, try_num_bool=True):
    """
    Infer types of values, possibly casting

    Parameters
    ----------
    values : ndarray
    na_values : set
    try_num_bool : bool, default try
       try to cast values to numeric (first preference) or boolean

    Returns
    -------
    converted : ndarray
    na_count : int
    """
    na_count = 0
    if issubclass(values.dtype.type, (np.number, np.bool_)):
        mask = algorithms.isin(values, list(na_values))
        na_count = mask.sum()
        if na_count > 0:
            if is_integer_dtype(values):
                values = values.astype(Int64Dtype())  #after change 
                #values = values.astype(np.float64)     #before change
            np.putmask(values, mask, np.nan)
        return values, na_count

也许它会解决你的问题

【讨论】：