使用 numpy / pandas 读取 Python 中 CSV 文件的最后 N 行答案

【问题标题】：Read the last N lines of a CSV file in Python with numpy / pandas使用 numpy / pandas 读取 Python 中 CSV 文件的最后 N 行
【发布时间】：2016-12-06 21:27:28
【问题描述】：

有没有使用numpy 或pandas 在Python 中读取CSV 文件最后N 行的快速方法？

我不能在numpy 中执行skip_header 或在pandas 中执行skiprow，因为文件的长度不同，而且我总是需要最后N 行。
我知道我可以使用纯 Python 从文件的最后一行逐行读取，但这会非常慢。如果必须的话，我可以这样做，但是使用numpy 或pandas（本质上是使用C）更有效的方法将非常感激。

【问题讨论】：

标签： python csv pandas numpy

【解决方案1】：

使用一个 10 行的小测试文件，我尝试了 2 种方法 - 解析整个内容并选择最后 N 行，而不是加载所有行，但只解析最后 N：

In [1025]: timeit np.genfromtxt('stack38704949.txt',delimiter=',')[-5:]
1000 loops, best of 3: 741 µs per loop

In [1026]: %%timeit 
      ...: with open('stack38704949.txt','rb') as f:
      ...:      lines = f.readlines()
      ...: np.genfromtxt(lines[-5:],delimiter=',')

1000 loops, best of 3: 378 µs per loop

这被标记为与Efficiently Read last 'n' rows of CSV into DataFrame 重复。那里使用的公认答案

from collections import deque

并收集了该结构中的最后 N 行。它还使用StringIO 将行提供给解析器，这是不必要的复杂化。 genfromtxt 从任何给它行的东西中获取输入，所以行列表就可以了。

In [1031]: %%timeit 
      ...: with open('stack38704949.txt','rb') as f:
      ...:      lines = deque(f,5)
      ...: np.genfromtxt(lines,delimiter=',') 

1000 loops, best of 3: 382 µs per loop

基本上和readlines和slice一样。

deque 在文件非常大的情况下可能会有优势，而且挂在所有行上的成本会很高。我认为它不会节省任何文件读取时间。仍然需要一行一行地阅读。

row_count 后跟 skip_header 方法的时间较慢；它需要读取文件两次。 skip_header 仍然需要读取行数。

In [1046]: %%timeit 
      ...: with open('stack38704949.txt',"r") as f:
      ...:       ...:     reader = csv.reader(f,delimiter = ",")
      ...:       ...:     data = list(reader)
      ...:       ...:     row_count = len(data)
      ...: np.genfromtxt('stack38704949.txt',skip_header=row_count-5,delimiter=',')

The slowest run took 5.96 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 760 µs per loop

为了计算行数，我们不需要使用csv.reader，尽管它似乎不会花费太多额外的时间。

In [1048]: %%timeit 
      ...: with open('stack38704949.txt',"r") as f:
      ...:    lines=f.readlines()
      ...:    row_count = len(data)
      ...: np.genfromtxt('stack38704949.txt',skip_header=row_count-5,delimiter=',')

1000 loops, best of 3: 736 µs per loop

【讨论】：

【解决方案2】：

选项 1

你可以用numpy.genfromtxt读取整个文件，把它作为一个numpy数组，取最后N行：

a = np.genfromtxt('filename', delimiter=',')
lastN = a[-N:]

选项 2

你可以用通常的文件读取做类似的事情：

with open('filename') as f:
    lastN = list(f)[-N:]

但这次你会得到最后 N 行的列表，作为字符串。

选项 3 - 不将整个文件读入内存

我们使用最多包含 N 个项目的列表来保存每次迭代的最后 N 行：

lines = []
N = 10
with open('csv01.txt') as f:
    for line in f:
        lines.append(line)
        if len(lines) > 10:
            lines.pop(0)

真正的 csv 需要稍作改动：

import csv
...
with ...
    for line in csv.reader(f):
    ...

【讨论】：

【解决方案3】：

使用pandasread_csv()的skiprows参数，更难的部分是找到csv中的行数。这是一个可能的解决方案：

with open('filename',"r") as f:
    reader = csv.reader(f,delimiter = ",")
    data = list(reader)
    row_count = len(data)

df = pd.read_csv('filename', skiprows = row_count - N)

【讨论】：