Pandas to_csv 总是用省略号替换 long numpy.ndarray答案

【问题标题】：Pandas to_csv always substitute long numpy.ndarray with ellipsisPandas to_csv 总是用省略号替换 long numpy.ndarray
【发布时间】：2020-06-24 20:20:18
【问题描述】：

在处理 pandas 0.14.0 中 DataFrame 的 to_csv() 函数时，我遇到了一个令人作呕的问题。我在 DataFrame df 中有一个长 numpy 数组列表作为一列：

>>> df['col'][0]    
array([   0,    1,    2, ..., 9993, 9994, 9995])
>>> len(df['col'][0])
46889
>>> type(df['col'][0][0])
<class 'numpy.int64'>

如果我将 df 保存为

df.to_csv('df.csv')

在 LibreOffice 中打开 df.csv，对应的列显示如下：

[ 0,    1,    2, ..., 9993, 9994, 9995]

而不是列出所有 46889 个数字。我想知道是否有一种方法可以强制 to_csv 列出所有数字而不是显示省略号？

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2 entries, 0 to 1
Data columns (total 4 columns):
pair          2 non-null object
ARXscore      2 non-null float64
bselect       2 non-null bool
col           2 non-null object
dtypes: bool(1), float64(1), object(2)

【问题讨论】：

df.info() 的输出是什么样的？像这样在数组条目中带有间距的粘贴输出看起来很奇怪。
在此处添加评论格式不正确，因此我修改了问题以包含 df.info()
这是一种奇怪的数据存储方式，为什么将numpy数组用作对象？
您将数组存储为字符串，因此您看到的输出是预期的。如果要输出一个数组，则需要获取该实际数组而不是截断的字符串。

标签： python numpy pandas

【解决方案1】：

从某种意义上说，这是printing the entire numpy array 的副本，因为 to_csv 只是询问 DataFrame 中的每个项目是否为 __str__，所以你需要看看它是如何打印的：

In [11]: np.arange(10000)
Out[11]: array([   0,    1,    2, ..., 9997, 9998, 9999])

In [12]: np.arange(10000).__str__()
Out[12]: '[   0    1    2 ..., 9997 9998 9999]'

如您所见，当它超过某个阈值时，它会用省略号打印，将其设置为 NaN：

np.set_printoptions(threshold='nan')

举个例子：

In [21]: df = pd.DataFrame([[np.arange(10000)]])

In [22]: df  # Note: pandas printing is different!!
Out[22]:
                                                   0
0  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...

In [23]: s = StringIO()

In [24]: df.to_csv(s)

In [25]: s.getvalue()  # ellipsis
Out[25]: ',0\n0,"[   0    1    2 ..., 9997 9998 9999]"\n'

一旦更改to_csv记录整个数组：

In [26]: np.set_printoptions(threshold='nan')

In [27]: s = StringIO()

In [28]: df.to_csv(s)

In [29]: s.getvalue()  # no ellipsis (it's all there)
Out[29]: ',0\n0,"[   0    1    2    3    4    5    6    7    8    9   10   11   12   13   14\n   15   16   17   18   19   20   21   22   23   24   25   26   27   28   29\n   30   31   32   33   34   35   36   37   38   39   40   41   42   43   44\n   45   46   47   48   49   50   51   52   53   54   55   56   57   58   59\n   60   61  # the whole thing is here...

如前所述，这通常不是 DataFrame（对象列中的 numpy 数组）的结构的好选择，因为您会失去很多 pandas 的速度/效率/魔力。

【讨论】：

这适用于 numpy 打印选项，但我的 pandas df 仍然会创建固定省略号

【解决方案2】：

np.set_printoptions(threshold='nan')

不适用于最新版本。使用：

import sys
import numpy
numpy.set_printoptions(threshold=sys.maxsize)

【讨论】：