如何将 pd.read_table 与 StringIO 文件对象一起使用？答案

【问题标题】：How to use pd.read_table with StringIO file object?如何将 pd.read_table 与 StringIO 文件对象一起使用？
【发布时间】：2016-05-17 05:26:49
【问题描述】：

我查看了read_table with stringIO and messy file，但它有一些我无法复制的东西，比如这个原始对象。无论如何，我想将表写入StringIO 文件对象，然后使用read_table 方法在pandas 中打开该StringIO 文件对象，但我得到EmptyDataError: No columns to parse from file。我要写入的文件太大而无法存储在内存中，所以我想分块读取它。使用StringIO 作为测试示例。顺便说一句，使用 Python 3.5.1

import numpy as np
import pandas as pd
from io import StringIO

#StringIO to write to
f = StringIO()

#Write to StringIO
dist = np.random.normal(100, 30, 10000)
for idx,s in enumerate(dist):
    f.write('{}\t{}\t{}\n'.format("label_A-%d" % idx, "label_B-%d" % idx, str(s)))

#Pandas DataFrame from it
DF = pd.read_table(f,sep="\t",header=None)
#EmptyDataError: No columns to parse from file

【问题讨论】：

您必须在阅读之前执行f.seek(0) 才能倒回开始。
你知道 pandas 已经包含了从块中读取文件的功能（参见chunksize 参数，例如read_csv）？
@BrenBarn 是的，这就是我要使用的，但我正在制作一个测试文件来练习它，我想使用 stringio 来保存测试数据

标签： python pandas

【解决方案1】：

StringIO 使用指针来跟踪流中的当前位置。将所有数据写入流后，使用f.seek(0) 将指针设置回起点。

import numpy as np
import pandas as pd
from io import StringIO

#StringIO to write to
f = StringIO()

#Write to StringIO
dist = np.random.normal(100, 30, 10000)
for idx,s in enumerate(dist):
    f.write('{}\t{}\t{}\n'.format("label_A-%d" % idx, "label_B-%d" % idx, str(s)))

# rewind the stream
f.seek(0)

#Pandas DataFrame from it
DF = pd.read_table(f,sep="\t",header=None)
#EmptyDataError: No columns to parse from file

【讨论】：

哇，太奇怪了。为什么需要“倒带”它？
@O.rka，我猜这是因为 StringIO 使用相同的指针来指示下一个读取位置和下一个插入位置。因此，如果您编写，指针将位于内容的末尾（您将在此处追加下一行）。如果您想在之后阅读，则假定所有内容都已被阅读（在当前指针之前）。我同意这不是很直观。