如何从文本文件中剪切第二列和第三列？ Python答案

【问题标题】：How to cut 2nd and 3rd column out of a textfile? python如何从文本文件中剪切第二列和第三列？ Python
【发布时间】：2014-06-06 23:39:30
【问题描述】：

我有一个制表符分隔的文件，其中包含如下行：

foo bar bar <tab>x y z<tab>a foo foo
...

想象一下 1,000,000 行，每行最多 200 个单词。每个单词平均5-6个字符。

到第 2 列和第 3 列，我可以这样做：

with open('test.txt','r') as infile:
  column23 = [i.split('\t')[1:3] for i in infile]

或者我可以使用 unix，How can i get 2nd and third column in tab delim file in bash?

import os
column23 = [i.split('\t') os.popen('cut -f 2-3 test.txt').readlines()]

哪个更快？有没有其他方法可以提取第 2 列和第 3 列？

【问题讨论】：

避免构建list，它非常消耗内存
为什么在最后一个例子中分裂？我认为切割会更快，但您应该使用较小的测试数据运行基准测试，
你有一个测试文件，我们可以用来查看哪种解决方案最快？
您可以使用 timeit 模块为您的编码计时。

标签： python bash cut csv

【解决方案1】：

两者都不使用。除非证明它太慢，否则请使用csv 模块，它的可读性要好得多。

import csv
with open('test.txt','r') as infile:
    column23 = [ cols[1:3] for cols in csv.reader(infile, delimiter="\t") ]

【讨论】：

【解决方案2】：

如果每行可以有数百个制表符分隔的条目，而您只想要第二个和第三个，那么您不需要split所有这些；您可以使用 maxsplit 参数来加快速度：

with open('test.txt','r') as infile:
    column23 = [i.split('\t', 3)[1:3] for i in infile]

谁知道呢，也许一个聪明的正则表达式会更快：

import re
regex = re.compile("^[^\t\n]*\t([^\t\n]*)\t([^\t\n]*)", re.MULTILINE)
with open('test.txt','r') as infile:
    columns23 = regex.findall(infile.read())

【讨论】：