从文件中读取整数的有效方法答案

【问题标题】：efficient way of reading integers from file从文件中读取整数的有效方法
【发布时间】：2015-07-31 09:09:29
【问题描述】：

我想将文件中的所有整数读入一个列表。所有数字由空格（一个或多个）或结束行字符（一个或多个）分隔。这样做的最有效和/或优雅的方式是什么？我有两种解决方案，但不知道好不好。

检查数字：

for line in open("foo.txt", "r"):
    for i in line.strip().split(' '):
        if i.isdigit():
            my_list.append(int(i))

处理异常：

for line in open("foo.txt", "r"):
    for i in line:
        try:
            my_list.append(int(i))
        except ValueError:
            pass

样本数据：

1   2     3
 4 56
    789         
9          91 56   

 10 
11

【问题讨论】：

我可能会这样做with open('foo.txt') as f: my_list = [int(i) for i in f if i.isdigit()]
@user3100115，由于尾随换行，它将无法工作。
我更喜欢 #2 而不是 #1—int() 无论如何都会验证你给它的字符串，所以在调用 int() 之前验证自己只会浪费时间。
……除了#2 doesn't actually do the same thing as #1。再看看它——它遍历每一行中的每个字符并尝试将其添加到列表中。

标签： python

【解决方案1】：

一种有效的方法是您的第一种方法，只需稍微更改使用with 语句打开文件，示例 -

with open("foo.txt", "r") as f:
    for line in f:
        for i in line.split():
            if i.isdigit():
                my_list.append(int(i))

通过与其他方法比较完成的计时测试 -

函数-

def func1():
    my_list = []
    for line in open("foo.txt", "r"):
        for i in line.strip().split(' '):
            if i.isdigit():
                my_list.append(int(i))
    return my_list

def func1_1():
    return [int(i) for line in open("foo.txt", "r") for i in line.strip().split(' ') if i.isdigit()]

def func1_3():
    my_list = []
    with open("foo.txt", "r") as f:
        for line in f:
            for i in line.split():
                if i.isdigit():
                    my_list.append(int(i))
    return my_list

def func2():            
    my_list = []            
    for line in open("foo.txt", "r"):
        for i in line.split():
            try:
                my_list.append(int(i))
            except ValueError:
                pass
    return my_list

def func3():
    my_list = []
    with open("foo.txt","r") as f:
        cf = csv.reader(f, delimiter=' ')
        for row in cf:
            my_list.extend([int(i) for i in row if i.isdigit()])
    return my_list

计时测试结果 -

In [25]: timeit func1()
The slowest run took 4.70 times longer than the fastest. This could mean that an intermediate result is being cached
1000 loops, best of 3: 204 µs per loop

In [26]: timeit func1_1()
The slowest run took 4.39 times longer than the fastest. This could mean that an intermediate result is being cached
1000 loops, best of 3: 207 µs per loop

In [27]: timeit func1_3()
The slowest run took 5.46 times longer than the fastest. This could mean that an intermediate result is being cached
10000 loops, best of 3: 191 µs per loop

In [28]: timeit func2()
The slowest run took 4.09 times longer than the fastest. This could mean that an intermediate result is being cached
1000 loops, best of 3: 212 µs per loop

In [34]: timeit func3()
The slowest run took 4.38 times longer than the fastest. This could mean that an intermediate result is being cached
10000 loops, best of 3: 202 µs per loop

鉴于将数据存储到列表中的方法，我相信上面的func1_3() 是最快的（如timeit所示）。

但是考虑到这一点，如果您真的要处理非常大的文件，那么最好使用生成器而不是将完整列表存储在内存中。

更新：正如 cmets 中所说，func2() 比 func1_3() 快（尽管在我的系统上它从来没有比 func1_3() 快，即使是整数），已更新foo.txt 包含数字以外的内容并进行计时测试 -

foo.txt

1 2 10 11
asd dd
 dds asda
22 44 32 11   23
dd dsa dds
21 12
12
33
45
dds
asdas
dasdasd dasd das d asda sda

测试-

In [13]: %timeit func1_3()
The slowest run took 6.17 times longer than the fastest. This could mean that an intermediate result is being cached
1000 loops, best of 3: 210 µs per loop

In [14]: %timeit func2()
1000 loops, best of 3: 279 µs per loop

In [15]: %timeit func1_3()
1000 loops, best of 3: 213 µs per loop

In [16]: %timeit func2()
1000 loops, best of 3: 273 µs per loop

【讨论】：

您的 func2() 迭代单个数字，而不是像 func1() 那样的标记。在我的系统上进行分析发现正确的func2()（我将其称为func2_1()）实际上比func1() 和func1_3()快。
什么是正确的 func2() ，不是我的 func2() ，请检查 OP 的问题，是他的代码
无论如何用正确的func2() 更新了帖子，但在我的系统上仍然较慢。
这是因为数据实际上除了数字之外没有任何其他内容。假设除了数字之外还有其他东西，func2() 的性能将比现在慢得多，因为引发和捕获异常的成本很高。
@BlacklightShining 用foo.txt 有字母和单词时的计时结果更新了帖子。

【解决方案2】：

如果您可以将整个文件作为字符串读取，那就很容易了。（即它不会太大）

fileStr = open('foo.txt').read().split() 
integers = [int(x) for x in fileStr if x.isdigit()]

read() 将其变成一个长字符串，split 根据空格（即空格和换行符）拆分成一个字符串列表。因此，您可以将其与列表推导结合起来，如果它们是数字，则将它们转换为整数。

正如 Bakuriu 所指出的，如果保证文件只有空格和数字，那么您不必检查 isdigit()。在这种情况下，使用 list(map(int, open('foo.txt').read().split())) 就足够了。如果任何内容是无效整数，该方法将引发错误，而另一个将跳过任何不是可识别数字的内容。

【讨论】：

【解决方案3】：

谢谢大家。我混合了您发布的一些解决方案。这对我来说似乎很好：

with open("foo.txt","r") as f:
    my_list = [int(i)  for line in f for i in line.split() if i.isdigit()]

【讨论】：

它比使用 try-except 更简洁（理解中不支持这些......_yet_），但由于您正在复制工作@ 987654322@ 通过调用 str.isdigit() 来实现。

【解决方案4】：

为什么不使用yield 关键字？代码将是...

def readInt():
    for line in open("foo.txt", "r"):
        for i in line.strip().split(' '):
            if i.isdigit():
                yield int(i)

然后你就可以阅读了

    for num in readInt():
        list.append(num)

【讨论】：

【解决方案5】：

my_list = []
with open('foo.txt') as f:
    for line in f:
        for s in line.split():
            try:
                my_list.append(int(s))
            except ValueError:
                pass

【讨论】：

【解决方案6】：

你可以使用列表推导来做到这一点

my_list = [int(i)  for j in open("1.txt","r") for i in j.strip().split(" ") if i.isdigit()]

或with open() method:

with open("1.txt","r") as f:
    my_list = [int(i)  for j in f for i in j.strip().split(" ") if i.isdigit()]

流程：

1.首先，您将遍历该行

2.然后您将遍历单词并查看它们是否为数字，如果我们将它们添加到列表中

编辑：

您需要将strip()添加到行，因为每行的结尾（最后一行除外）都会有新的行空间（“\n”），您是否尝试is.digit("number\n") you will get false

即）

>>> "1\n".isdigit()
False

edit2：

输入：

1
qw 2
23 we 32

读取时的文件数据：

a=open("1.txt","r")

repr(a.read())
"'1\\nqw 2\\n23 we 32'"

你可以看到"\n"这个新行右边会影响进程

当我在没有strip() 的情况下运行函数时，它不会将1 and 2 作为数字，因为它由换行符组成

my_list = [int(i)  for j in open("1.txt","r") for i in j.split(" ") if i.isdigit()]
my_list
[23, 32]

从输出中可以清楚地看到缺少 1 和 2。如果我们使用 strip() 可以避免这种情况

【讨论】：

我稍微改了一下，对我来说似乎不错：` with open("foo.txt","r") as f: my_list = [int(i) for line in f for i in line.split() if i.isdigit()] `
让我们continue this discussion in chat。

【解决方案7】：

试试这个：

with open('file.txt') as f:
    nums = []
    for l in f:
        l = l.strip()
        nums.extend([int(i) for i in l.split() if i.isdigit() and l])

如果 newlines('\n') 存在，则上面需要l.strip()，因为i.isdigit('6\n') 将不起作用。

list.extend 在这里派上用场

末尾的and l 确保丢弃任何空列表结果

str.split 默认在空白处分割。而with 块将在其中的代码执行后自动关闭文件。我也使用了list comprehensions

【讨论】：

【解决方案8】：

这是我找到的最快的方法：

import re
regex = re.compile(r"\D+")

with open("foo.txt", "r") as f:
    my_list = list(map(int, regex.split(f.read())))

虽然结果可能取决于文件的大小。

【讨论】：