如何在 python 中解析非常大的文件？ [复制]答案

【问题标题】：How do I parse extremely large file in python? [duplicate]如何在 python 中解析非常大的文件？ [复制]
【发布时间】：2018-12-01 16:38:33
【问题描述】：

我有这个大约 10GB 的日志文件“internet.log”。当我在 python 中解析它时，我得到一个异常“MemoryError”。日志文件看起来像这样......

Jun 15 16:26:21 dnsmasq[1979]: reply fd-geoycpi-uno.gycpi.b.yahoodns.net is 74.6.160.107
Jun 15 16:26:21 dnsmasq[1979]: reply fd-geoycpi uno.gycpi.b.yahoodns.net is 216.115.100.123
Jun 15 16:26:21 dnsmasq[1979]: reply fd-geoycpi-uno.gycpi.b.yahoodns.net is 216.115.100.124
Jun 15 16:26:21 dnsmasq[1979]: reply fd-geoycpi-uno.gycpi.b.yahoodns.net is 74.6.160.106
Jun 15 16:26:21 dnsmasq[1979]: query[A] fd-geoycpi-uno.gycpi.b.yahoodns.net from 192.168.1.33
Jun 15 16:26:21 dnsmasq[1979]: cached fd-geoycpi-uno.gycpi.b.yahoodns.net is 74.6.160.106
Jun 15 16:26:21 dnsmasq[1979]: cached fd-geoycpi-uno.gycpi.b.yahoodns.net is 216.115.100.124
Jun 15 16:26:21 dnsmasq[1979]: cached fd-geoycpi-uno.gycpi.b.yahoodns.net is 216.115.100.123
Jun 15 16:26:21 dnsmasq[1979]: cached fd-geoycpi-uno.gycpi.b.yahoodns.net is 74.6.160.107
Jun 15 16:26:23 dnsmasq[1979]: query[A] armdl.adobe.com from 192.168.1.24

我目前正在使用这种方法来解析日志文件：

def parse():
Date = []
IPAddress = []
DomainsVisited = []
with open("internet.log", "r") as file:
    content = file.readlines()
    for items in content:
        if 'query[A]' in items:
            getDate(Date, items)
            getIPAddress(IPAddress, items)
            getDomainsVisited(DomainsVisited, items)
finalResult = [[i, j, k] for i, j, k in zip(Date, IPAddress, DomainsVisited)]
return display(finalResult)

如果我解析一个大约 10MB 的日志文件，则会显示输出，但是当我去解析 10GB 的日志文件时，我得到了错误。我怎样才能解决这个问题？谢谢。

【问题讨论】：

嗯，您正在使用file.readlines() 将整个文件读入内存。说for items in file: 将一次读一行。
您的其余代码看起来不正确。例如。对于每个item，您都在破坏Date，而不是附加到列表中。
@PeterWood 对不起，我会改变它
@PeterWood 对于文件中的项目也不起作用。我在 python 控制台中收到此消息“进程已完成，退出代码 247”

标签： python python-3.x parsing

【解决方案1】：

您正在使用readlines 将整个文件读入内存。

你可以说for items in file一次读一行。

稍微清理一下你的代码，使用更好的变量名和列表解析来构建结果：

with open("internet.log") as log:
    finalResults = [[getDate(line), getIPAddress(line), getDomainsVisited(line)]
                    for line in log
                    if 'query[A]' in line]

我会将结果提取到一个函数中：

def parse_log_line(line):
    return [getDate(line),
            getIPAddress(line),
            getDomainsVisited(line)]

那么您的代码将是：

with open("internet.log") as log:
    finalResults = [parse_log_line(line)
                    for line in log
                    if 'query[A]' in line]

【讨论】：

【解决方案2】：

您不应使用file.readlines()。这样做会立即将整个文件读入内存，这很可能会立即将其填满。相反，遍历文件：

with open("internet.log", "r") as file:
    for items in file:

（当然，取决于您对数据的处理方式，当您浏览文件时，这仍然可能会中断。）

【讨论】：

没用 :( 我在 python 控制台中收到此消息“进程完成，退出代码 247”