在 pandas 0.10.1 上使用 pandas.read_csv 指定 dtype float32答案

【问题标题】：Specifying dtype float32 with pandas.read_csv on pandas 0.10.1在 pandas 0.10.1 上使用 pandas.read_csv 指定 dtype float32
【发布时间】：2013-02-19 02:18:34
【问题描述】：

我正在尝试使用 pandas read_csv 方法读取一个简单的空格分隔文件。但是，熊猫似乎没有遵守我的dtype 论点。也许我指定的不正确？

我已经将我对read_csv 的有点复杂的调用提炼为这个简单的测试用例。我实际上在我的“真实”场景中使用了converters 参数，但为了简单起见，我删除了它。

下面是我的 ipython 会话：

>>> cat test.out
a b
0.76398 0.81394
0.32136 0.91063
>>> import pandas
>>> import numpy
>>> x = pandas.read_csv('test.out', dtype={'a': numpy.float32}, delim_whitespace=True)
>>> x
         a        b
0  0.76398  0.81394
1  0.32136  0.91063
>>> x.a.dtype
dtype('float64')

我也尝试过将其与dtype 或numpy.int32 或numpy.int64 一起使用。这些选择会导致异常：

AttributeError: 'NoneType' object has no attribute 'dtype'

我假设 AttributeError 是因为 pandas 不会自动尝试将浮点值转换/截断为整数？

我在一个 32 位机器上运行 32 位版本的 Python。

>>> !uname -a
Linux ubuntu 3.0.0-13-generic #22-Ubuntu SMP Wed Nov 2 13:25:36 UTC 2011 i686 i686 i386 GNU/Linux
>>> import platform
>>> platform.architecture()
('32bit', 'ELF')
>>> pandas.__version__
'0.10.1'

【问题讨论】：

我认为这类似于this issue on github...
@AndyHayden 我认为你是对的。 AttributeError 问题正是 github 问题所提到的。但是，在我的其他场景中，值是浮点数，但当我尝试使用 float32 而不是 float64 等时，pandas 不服从 dtype 参数。

标签： python pandas numpy

【解决方案1】：

0.10.1 并不太支持 float32

看到这个http://pandas.pydata.org/pandas-docs/dev/whatsnew.html#dtype-specification

你可以在 0.11 中这样做：

# dont' use dtype converters explicity for the columns you care about
# they will be converted to float64 if possible, or object if they cannot
df = pd.read_csv('test.csv'.....)

#### this is optional and related to the issue you posted ####
# force anything that is not a numeric to nan
# columns are the list of columns that you are interesetd in
df[columns] = df[columns].convert_objects(convert_numeric=True)


    # astype
    df[columns] = df[columns].astype('float32')

see http://pandas.pydata.org/pandas-docs/dev/basics.html#object-conversion

Its not as efficient as doing it directly in read_csv (but that requires
 some low-level changes)

我已经确认使用 0.11-dev，这确实有效（在 32 位和 64 位上，结果相同）

In [5]: x = pd.read_csv(StringIO.StringIO(data), dtype={'a': np.float32}, delim_whitespace=True)

In [6]: x
Out[6]: 
         a        b
0  0.76398  0.81394
1  0.32136  0.91063

In [7]: x.dtypes
Out[7]: 
a    float32
b    float64
dtype: object

In [8]: pd.__version__
Out[8]: '0.11.0.dev-385ff82'

In [9]: quit()
vagrant@precise32:~/pandas$ uname -a
Linux precise32 3.2.0-23-generic-pae #36-Ubuntu SMP Tue Apr 10 22:19:09 UTC 2012 i686 i686 i386 GNU/Linux

【讨论】：

astype 或 convert_objects 是首选方式吗？
如果你需要一个特定的 dtype 然后使用 astype，convert_objects 更适合从 object dtypes 转换（并且不像以前的版本那样必要）
那么这是否被认为是熊猫中的一个错误？我可以通过dtype 并没有得到我要求的内容或错误等，这似乎有点欺骗性。
看我的回答，在 0.10.1 中发现一个错误
+1 for .convert_objects(convert_numeric=True)，解决了我的问题，即拥有混合 dtype 的数据框并希望其中一些被解析为浮点数。

【解决方案2】：

In [22]: df.a.dtype = pd.np.float32

In [23]: df.a.dtype
Out[23]: dtype('float32')

以上在 pandas 0.10.1 下对我来说很好

【讨论】：

fyi，这是就地的（这是隐式的），对于非浮点数据是不安全的
@Jeff 是的，这是一个就地强制转换，对于非浮点值不安全
df = pd.read_csv('sample.out', converters={'a':lambda x: pd.np.float32(x)}, delim_whitespace=True) 似乎也不起作用。
我喜欢它是就地的，所以它在内存和速度上可能会更好一点。但是，如果您使用 convert_numeric=True 参数，convert_objects 将设置 NaN。如果无法完成转换，这种方法可能会引发一些异常或其他问题。但是，我并没有过多地研究这方面的细节。
这就是 convert_numeric=True 的意义，从其他数字列中删除“讨厌”值