在 python 中将超过 100 万条记录写入 csv答案

【问题标题】：Writing over 1 milion records to csv in python在 python 中将超过 100 万条记录写入 csv
【发布时间】：2017-07-31 15:33:32
【问题描述】：

我正在使用 python 将一些数据提取到 csv 文件中，数据超过 100 万条记录。毫无疑问，我的脚本似乎存在内存问题，因为经过 5 个小时的艰苦努力并编写了大约 190k 多条记录，脚本运行进程被终止。

这是我的终端

(.venv)[cv1@mdecv01 maidea]$ python common_scripts/script_tests/ben-test-extract.py BEN
Generating CSV file. Please wait ...
Preparing to write file: BEN-data-20170731.csv
Killed
(.venv)[cv1@mdecv01 maidea]$

他们是我可以通过适当的内存管理提取这些数据的方法吗？

here 是我的脚本

【问题讨论】：

可以在Beneficiary.objects.all()上操作吗？试试打印或其他东西。否则，如果 for 循环中出现内存问题，请尝试使用生成器，即yield
可能会在问题中发布您的代码（或缩短的版本）
还包括您的数据库设置。

标签： python django csv

【解决方案1】：

您没有利用select_related 或prefetch_related。如果不使用这两种方法，每次访问相关字段（ForeignKey、ManyToManyField）时都会执行数据库调用

for beneficiary in Beneficiary.objects.all():
    if beneficiary.is_active:
        household = beneficiary.household
        if len(beneficiary.enrolments) > 0 and len(beneficiary.interventions) > 1:

应该是这样的

for beneficiary in Beneficiary.objects.select_related(
    'household'
).prefetch_related(
    'enrolments',
    'interventions'
):
    if beneficiary.is_active:
        household = beneficiary.household
        if len(beneficiary.enrolments.all()) > 0 and len(beneficiary.interventions.all()) > 1:

【讨论】：

【解决方案2】：

在查询集中过滤而不是提取所有数据，例如 .filter(is_active=true) ，按计数过滤，例如 annotate(interventions_count=Count('interventions')).filter(interventions_count__gte=1)
在迭代中使用偏移量和限制拉取数据，而不是一次性拉取所有数据 [来自（较小的内存消耗）[0:100]
利用 select_related 和 prefetch_related 来预选你需要的表

【讨论】：