我知道这已经晚了一年,但问题仍然存在,我很惊讶json.iterencode() 没有被提及。
本例中iterencode 的潜在问题是,您可能希望使用生成器对大型数据集进行迭代处理,而 json 编码不会序列化生成器。
解决此问题的方法是使用子类列表类型并覆盖 __iter__ 魔术方法,以便您可以产生生成器的输出。
这里是这个列表子类的一个例子。
class StreamArray(list):
"""
Converts a generator into a list object that can be json serialisable
while still retaining the iterative nature of a generator.
IE. It converts it to a list without having to exhaust the generator
and keep it's contents in memory.
"""
def __init__(self, generator):
self.generator = generator
self._len = 1
def __iter__(self):
self._len = 0
for item in self.generator:
yield item
self._len += 1
def __len__(self):
"""
Json parser looks for a this method to confirm whether or not it can
be parsed
"""
return self._len
从这里开始使用非常简单。获取生成器句柄,将其传递给StreamArray 类,将流数组对象传递给iterencode() 并遍历块。块将是 json 格式的输出,可以直接写入文件。
示例用法:
#Function that will iteratively generate a large set of data.
def large_list_generator_func():
for i in xrange(5):
chunk = {'hello_world': i}
print 'Yielding chunk: ', chunk
yield chunk
#Write the contents to file:
with open('/tmp/streamed_write.json', 'w') as outfile:
large_generator_handle = large_list_generator_func()
stream_array = StreamArray(large_generator_handle)
for chunk in json.JSONEncoder().iterencode(stream_array):
print 'Writing chunk: ', chunk
outfile.write(chunk)
显示 yield 和 writes 的输出是连续发生的。
Yielding chunk: {'hello_world': 0}
Writing chunk: [
Writing chunk: {
Writing chunk: "hello_world"
Writing chunk: :
Writing chunk: 0
Writing chunk: }
Yielding chunk: {'hello_world': 1}
Writing chunk: ,
Writing chunk: {
Writing chunk: "hello_world"
Writing chunk: :
Writing chunk: 1
Writing chunk: }