如何获得文件中的最大数字？答案

【问题标题】：How to get the biggest number in a file?如何获得文件中的最大数字？
【发布时间】：2015-08-16 00:27:58
【问题描述】：

我想获取文件中的最大数字，其中数字是可以出现在文件任何位置的整数。

我想过做以下事情：

grep -o '[-0-9]*' myfile | sort -rn | head -1

这使用grep 从文件中获取所有整数，每行输出一个。然后，sort 对它们进行排序，head 打印第一个。

但后来想到sort -r可能会造成一些开销，所以我选择了：

grep -o '[-0-9]*' myfile | sort -n | tail -1

为了看看什么是最快的，我创建了一个包含一些随机数据的大文件，如下所示：

$ cat a
hello 123 how are you i am fine 42342234 and blab bla bla 
and 3624 is another number
but this is not enough for -23 234245
$ for i in {1..50000}; do cat a >> myfile ; done

使文件包含 150K 行。

现在我比较GNU bash version 4.2 和sys 的性能，sort -rn 的性能要小得多：

$ time grep -o '[-0-9]*' myfile | sort -n | tail -1
42342234

real    0m1.823s
user    0m1.865s
sys 0m0.045s

$ cp myfile myfile2    #to prevent using cached info
$ time grep -o '[-0-9]*' myfile2 | sort -rn | head -1
42342234

real    0m1.864s
user    0m1.926s
sys 0m0.027s

所以我在这里有两个问题：

什么是最好的，sort -r | tail -1 或 sort -rn | head -1？
是否有最快的方法来获取给定文件中的最大整数？

测试解决方案

所以我运行了所有命令并比较了让它们找到值的时间。为了让事情更可靠，我创建了一个更大的文件，比我在问题中提到的文件大 10 倍：

$ cat a
hello 123 how are you i am fine 42342234 and blab bla bla 
and 3624 is another number
but this is not enough for -23 234245
$ time awk -v s="$(cat a)" 'BEGIN{for (i=1;i<=500000;i++) print s}' > myfile
$ wc myfile 
1500000 13000000 62000000 myfile

基准测试，我看到hek2mgl's solution 是最快的：

$ time awk 'NR==1 || max < 0+$0 {max=0+$0} END {print max}' RS='[[:space:]]+' myfile
42342234

real    0m3.979s
user    0m3.970s
sys 0m0.007s
$ time awk '{for(i=1;i<=NF;i++)if(int($i)){a[$i]=$i}}END{x=asort(a);print a[x]}' myfile 
42342234

real    0m2.203s
user    0m2.196s
sys 0m0.006s
$ time awk '{for(i=1;i<=NF;i++){m=(m<$i)?$i:m}}END{print m}' RS='$' FPAT='-{0,1}[0-9]+' myfile
42342234

real    0m0.926s
user    0m0.848s
sys 0m0.077s
$ time tr ' ' '\n' < myfile | sort -rn | head -1
42342234

real    0m11.089s
user    0m11.049s
sys 0m0.086s
$ time perl -MList::Util=max -lane '$m = max $m, map {0+$_} @F} END {print $max' myfile


real    0m6.166s
user    0m6.146s
sys 0m0.011s

【问题讨论】：

另一个可能很快的选项：awk -v RS="[^-0-9]+" '$0>max{max=$0}END{print max}' myfile.
我检查了这个，它也很快，@TomFenech 不知道这里最好的是什么，将其添加到 hek2mgl 的答案中，将其保留在 cmets 中，或者添加一个新的答案它。还是谢谢！
@EdMorton 数字总是被空格或行首/行尾包围。它是否澄清了我的陈述？既然您提到了这种情况，那么grep 可能是这样的：grep -oE '\b-?[0-9]+'。虽然我猜还是会出现一些极端情况。

标签： performance bash sorting

【解决方案1】：

我确信使用汇编程序优化的 C 实现将是最快的。我也可以想到一个程序，它将文件分成多个块并将每个块映射到单个处理器内核上，然后得到 nproc 剩余数量的最大值。

只是使用现有的命令行工具，你试过awk吗？

time awk '{for(i=1;i<=NF;i++){m=(m<$i)?$i:m}}END{print m}' RS='$' FPAT='-{0,1}[0-9]+' myfile

与已接受答案中的 perl 命令相比，它看起来可以在大约 50% 的时间内完成这项工作：

time perl -MList::Util=max -0777 -nE 'say max /-?\d+/g' myfile
cp myfile myfile2

time awk '{for(i=1;i<=NF;i++){m=(m<$i)?$i:m}}END{print m}' RS='$' FPAT='-{0,1}[0-9]+' myfile2

给我：

42342234

real    0m0.360s
user    0m0.340s
sys 0m0.020s
42342234

real    0m0.193s   <-- Good job awk! You are the winner.
user    0m0.185s
sys 0m0.008s

【讨论】：

这是一个很好的答案，我很惊讶：我没想到awk 会比sort 和tail/head 等标准命令更快（可能是@987654328 @什么消耗了大部分时间）。
我会说它是sort 命令花费的时间最多，至少与grep 大致相同。这只是比awk 更多的迭代。
我刚刚跑了time grep -o '[-0-9]*' myfile &>/dev/null，它需要real 0m1.534s, user 0m1.530s, sys 0m0.001s！！
@fedorqui 你为什么感到惊讶？ awk 所做的是 O(N)，而 sort 是 O(N log(N))。
@lcd047 但是，排序不会对相同数量的行进行操作。不管你是对的，这里的不同之处在于迭代次数。

【解决方案2】：

在 awk 中你可以说：

awk '{for(i=1;i<=NF;i++)if(int($i)){a[$i]=$i}}END{x=asort(a);print a[x]}' file

说明

根据我的经验，对于大多数任务来说，awk 是最快的文本处理语言，而我所见过的唯一速度相当的语言（在 Linux 系统上）是用 C/C++ 编写的程序。

在上面的代码中，使用最少的函数和命令可以加快执行速度。

for(i=1;i<=NF;i++) - Loops through fields on the line. Using the default FS/RS and looping
                     this way is usually faster than using custom ones as awk is optimised 
                     to use the default

if(int($i))        - Checks if the field is not equal to zero and as strings are set to zero 
                     by int, does not execute the next block if the field is a string. I 
                     believe this is the quickest way to perform this check

{a[$i]=$i}         - Sets an array variable with the number as key and value. This means 
                     there will only be as many array variables as there are numbers in 
                     the file and will hopefully be quicker than a comparison of every 
                     number 

END{x=asort(a)     - At the end of the file, use asort on the array and store the s
                     size of the array in x.

print a[x]         - Print the last element in the array.

基准测试

我的：

time awk '{for(i=1;i<=NF;i++)if(int($i)){a[$i]=$i}}END{x=asort(a);print a[x]}' file

接受

real    0m0.434s
user    0m0.357s
sys     0m0.008s

hek2mgl's：

awk '{m=(m<$0 && int($0))?$0:m}END{print m}' RS='[[:space:]*]' file

接受

real    0m1.256s
user    0m1.134s
sys     0m0.019s

对于那些想知道为什么它更快的人，这是由于使用了默认的 FS 和 RS，而 awk 已针对使用进行了优化

变化

awk '{m=(m<$0 && int($0))?$0:m}END{print m}' RS='[[:space:]*]'

到

awk '{for(i=1;i<=NF;i++)m=(m<$i && int($i))?$i:m}END{print m}'

提供时间

real    0m0.574s
user    0m0.497s
sys     0m0.011s

这仍然比我的命令慢一点。

我相信仍然存在的细微差别是由于 asort() 仅处理大约 6 个数字，因为它们仅在数组中保存一次。

相比之下，另一个命令是对文件中的每一个数字进行比较，这将更加昂贵。

如果文件中的所有数字都是唯一的，我认为它们的速度应该差不多。

Tom Fenech's：

 time awk -v RS="[^-0-9]+" '$0>max{max=$0}END{print max}' myfile

 real    0m0.716s
 user    0m0.612s
 sys     0m0.013s

不过，这种方法的一个缺点是，如果所有数字都低于零，那么 max 将为空白。

Glenn Jackman's：

time awk 'NR==1 || max < 0+$0 {max=0+$0} END {print max}' RS='[[:space:]]+' file

real    0m1.492s
user    0m1.258s
sys     0m0.022s

和

time perl -MList::Util=max -0777 -nE 'say max /-?\d+/g' file

real    0m0.790s
user    0m0.686s
sys     0m0.034s

perl -MList::Util=max -0777 -nE 'say max /-?\d+/g' 的好处是，如果 0 作为最大数字出现在文件中，它是唯一有效的答案，并且如果所有数字都是负数，它也有效。

注意事项

所有时间均代表 3 次测试的平均值

【讨论】：

你错过了添加“在 Pentium 2 上进行基准测试”：D 你能显示wc -l file 的结果吗？
无需删除。如果您记录您的代码如何工作并更新基准，这是一个完全有效的答案。
我可以确认它比我的尝试更快。我真的很想知道为什么atm...仍然没有足够的时间调查...
感谢精彩的解释和对细节的关注。获得大+1。我将等待几个小时来接受答案，以便更多人可以浏览帖子并提出其他解决方案。我还想将您的方法与Tom Fenech's 进行比较，看看哪一种表现更好。
您可以保存一个或两个循环：从if(int($i)){a[$i]=$i} 到a[0+$i]=0+$i -- 通过将值加零来强制将其视为数字。

【解决方案3】：

我对 awk 的速度感到惊讶。 perl 通常很快，但是：

$ for ((i=0; i<1000000; i++)); do echo $RANDOM; done > rand

$ time awk 'NR==1 || max < 0+$0 {max=0+$0} END {print max}' RS='[[:space:]]+' rand
32767

real    0m0.890s
user    0m0.887s
sys 0m0.003s

$ time perl -MList::Util=max -lane '$m = max $m, map {0+$_} @F} END {print $max' rand 
32767

real    0m1.110s
user    0m1.107s
sys 0m0.002s

我想我找到了一个赢家：使用 perl，将文件作为单个字符串 slurp，找到（可能为负的）整数，然后取最大值：

$ time perl -MList::Util=max -0777 -nE 'say max /-?\d+/g' rand
32767

real    0m0.565s
user    0m0.539s
sys 0m0.025s

需要更多的“系统”时间，但更少的实时性。

也适用于只有负数的文件：

$ cat file
hello -42 world
$ perl -MList::Util=max -0777 -nE 'say max /-?\d+/g' file
-42

【讨论】：

如果文件中有非数字，这将永远不会打印负数。
>| 是一个重定向运算符，它覆盖了 bash 的 noclobber shell 选项。这里不需要。
找到了一个 awk 解决方案，它比 perl 尝试快了近 50%。 stackoverflow.com/a/30592354/171318 :)
-E 标志有什么作用？它似乎允许使用say，但我在标志列表中找不到任何包含它的文档？

【解决方案4】：

我怀疑这将是最快的：

$ tr ' ' '\n' < file | sort -rn | head -1
42342234

第三轮：

$ time tr ' ' '\n' < file | sort -rn | head -1
42342234
real    0m0.078s
user    0m0.000s
sys     0m0.076s

顺便说一句，不要编写外壳循环来操作文本，即使它正在创建示例输入文件：

$ time awk -v s="$(cat a)" 'BEGIN{for (i=1;i<=50000;i++) print s}' > myfile

real    0m0.109s
user    0m0.031s
sys     0m0.061s

$ wc -l myfile
150000 myfile

与问题中建议的 shell 循环相比：

$ time for i in {1..50000}; do cat a >> myfile2 ; done

real    26m38.771s
user    1m44.765s
sys     17m9.837s

$ wc -l myfile2
150000 myfile2

如果我们想要更稳健地处理包含非整数字符串中数字的输入文件，我们需要这样的东西：

$ cat b
hello 123 how are you i am fine 42342234 and blab bla bla
and 3624 is another number
but this is not enough for -23 234245
73 starts a line
avoid these: 3.14 or 4-5 or $15 or 2:30 or 05/12/2015

$ grep -o -E '(^| )[-]?[0-9]+( |$)' b | sort -rn
 42342234
 3624
 123
73
 -23

$ time awk -v s="$(cat b)" 'BEGIN{for (i=1;i<=50000;i++) print s}' > myfileB
real    0m0.109s
user    0m0.000s
sys     0m0.076s

$ wc -l myfileB
250000 myfileB

$ time grep -o -E '(^| )-?[0-9]+( |$)' myfileB | sort -rn | head -1 | tr -d ' '
42342234
real    0m2.480s
user    0m2.509s
sys     0m0.108s

请注意，输入文件的行数比原始文件多，使用此输入，上述强大的 grep 解决方案实际上比我在此问题开头发布的原始文件更快：

$ time tr ' ' '\n' < myfileB | sort -rn | head -1
42342234
real    0m4.836s
user    0m4.445s
sys     0m0.277s

【讨论】：

有趣，基于your edit 我也在检查最快的，sort -rn | head -1 或sort -n | tail -1。
是的，如果您使用尾部或头部方法，看起来这并不重要。我认为 head 可能会有所节省，因为它可能会在第一行之后停止读取。
如果有一个像 100n 这样的词，而 99 是最大的数字，sort -n 会显示 100n 是最大的（至少在 mac 上是这样，这是我在 atm 上测试的全部内容）。
我在这里考虑了基本情况，只是整数（没有浮点数）并且出现“单独”（没有“100n”或类似的）。当然，有所有的底片是可能的。更多条件会导致更通用的解决方案但速度较慢，但这里不是这种情况。也许将来我会问一些相关的问题，但考虑到这些事情。此外，创建myfile 也很好。对我来说，for i in ... 并没有那么慢（一分钟左右），但即使是这样，继续记住你总是告诉和使用 awk 的内容仍然是件好事。
我终于接受了 Glenn Jackman 的回答，因为他的 Perl 脚本对我来说是最快的。对我来说，tr 方法比 awk 慢三倍。非常感谢您一直以来的调查！