按所有者在目录中汇总文件大小的最快方法答案

【问题标题】：fastest way to sum the file sizes by owner in a directory按所有者在目录中汇总文件大小的最快方法
【发布时间】：2019-08-12 11:57:13
【问题描述】：

我正在使用下面的命令使用别名来打印目录中所有者所有文件大小的总和

ls -l $dir | awk ' NF>3 { file[$3]+=$5 } \
END { for( i in file) { ss=file[i]; \
if(ss >=1024*1024*1024 ) {size=ss/1024/1024/1024; unit="G"} else \ 
if(ss>=1024*1024) {size=ss/1024/1024; unit="M"} else {size=ss/1024; unit="K"}; \
format="%.2f%s"; res=sprintf(format,size,unit); \
printf "%-8s %12d\t%s\n",res,file[i],i }}' | sort -k2 -nr

但是，它似乎并不总是很快。

是否有可能以其他方式获得相同的输出，但速度更快？

【问题讨论】：

why not parse ls
您不需要在字符串中转义换行符。
检查superuser.com/a/597173
慢的时候，单独ls -l $dir有多快？在某些文件系统上，列出大目录非常非常慢。
我在一个这样的目录下有大约 308,530 个文件..

标签： linux shell perl

【解决方案1】：

解析来自 ls 的输出 - 坏主意。

改用find 怎么样？

从目录${dir}开始
- 限制在该目录级别 (-maxdepth 1)
- 文件限制 (-type f)
- 打印一行包含用户名和文件大小（以字节为单位） (-printf "%u %s\n")
通过 perl 过滤器运行结果
- 分割每一行 (-a)
- 将大小（字段 1）添加到键（字段 0）下的哈希中
- 最后（END {...}）打印出哈希内容，按key排序，即用户名

$ find ${dir} -maxdepth 1 -type f -printf "%u %s\n" | \
     perl -ane '$s{$F[0]} += $F[1]; END { print "$_ $s{$_}\n" foreach (sort keys %s); }'
stefanb 263305714

使用 Perl 的解决方案：

#!/usr/bin/perl
use strict;
use warnings;
use autodie;

use File::Spec;

my %users;
foreach my $dir (@ARGV) {
    opendir(my $dh, $dir);

    # files in this directory
    while (my $entry = readdir($dh)) {
        my $file = File::Spec->catfile($dir, $entry);

        # only files
        if (-f $file) {
            my($uid, $size) = (stat($file))[4, 7];
            $users{$uid} += $size
        }
    }

    closedir($dh);
}

print "$_ $users{$_}\n" foreach (sort keys %users);

exit 0;

试运行：

$ perl dummy.pl .
1000 263618544

有趣的区别。 Perl 解决方案在我的测试目录中发现了比find 解决方案多3 个文件。我不得不思考为什么会这样……

【讨论】：

它应该为所有所有者打印..不仅仅是当前用户..文件由不同用户拥有
相应更新

【解决方案2】：

不确定为什么在使用 awk 时将问题标记为 perl。

这是一个简单的 perl 版本：

#!/usr/bin/perl

chdir($ARGV[0]) or die("Usage: $0 dir\n");

map {
    if ( ! m/^[.][.]?$/o ) {
        ($s,$u) = (stat)[7,4];
        $h{$u} += $s;
    }
} glob ".* *";

map {
    $s = $h{$_};
    $u = !( $s      >>10) ? ""
       : !(($s>>=10)>>10) ? "k"
       : !(($s>>=10)>>10) ? "M"
       : !(($s>>=10)>>10) ? "G"
       :   ($s>>=10)      ? "T"
       :                    undef
       ;
    printf "%-8s %12d\t%s\n", $s.$u, $h{$_}, getpwuid($_)//$_;
} keys %h;

glob 获取我们的文件列表
m// 丢弃 . 和 ..
stat 大小和uid
在%h中累积大小
通过位移计算单位（>>10 是整数除以 1024）
将 uid 映射到用户名（// 提供备用）
打印结果（未排序）
注意：与其他一些答案不同，此代码不会递归到子目录中

要排除符号链接、子目录等，请将if 更改为适当的-X 测试。（例如(-f $_)、(!-d $_ and !-l $_) 等）。有关缓存统计结果的_ 文件句柄优化，请参阅perl docs。

【讨论】：

我在脚本中没有看到m///。我猜你指的是!/^[.][.]?$/o？
是的。 // 是m// 的快捷方式。 m 仅在您想使用不同的分隔符时才需要（例如m[]、m<> 等）。三个斜线是错字。
请在脚本中使用m// 或在说明中使用脚本中的代码。事实上，对于不太了解 Perl 的人来说，这非常令人困惑。

【解决方案3】：

另一个 perl 显示按用户排序的总大小：

#!/usr/bin/perl
use warnings;
use strict;
use autodie;
use feature qw/say/;
use File::Spec;
use Fcntl qw/:mode/;

my $dir = shift;
my %users;

opendir(my $d, $dir);
while (my $file = readdir $d) {
  my $filename = File::Spec->catfile($dir, $file);
  my ($mode, $uid, $size) = (stat $filename)[2, 4, 7];
  $users{$uid} += $size if S_ISREG($mode);
}
closedir $d;

my @sizes = sort { $a->[0] cmp $b->[0] }
  map { [ getpwuid($_) // $_, $users{$_} ] } keys %users;
local $, = "\t";
say @$_ for @sizes;

【讨论】：

@stack0114106 它将大小跟踪限制为常规文件 - 跳过目录、fifos、套接字、设备等。与另一个答案中的-f $file 相同，只是检查方式不同。跨度>

【解决方案4】：

我在操作中看到了一些 awk 吗？这是 GNU awk 中使用 filefuncs 扩展的一个：

$ cat bar.awk
@load "filefuncs"
BEGIN {
    FS=":"                                     # passwd field sep
    passwd="/etc/passwd"                       # get usernames from passwd
    while ((getline < passwd)>0)
        users[$3]=$1
    close(passwd)                              # close passwd

    if(path="")                                # set path with -v path=...
        path="."                               # default path is cwd
    pathlist[1]=path                           # path from the command line
                                               # you could have several paths
    fts(pathlist,FTS_PHYSICAL,filedata)        # dont mind links (vs. FTS_LOGICAL)
    for(p in filedata)                         # p for paths
        for(f in filedata[p])                  # f for files
            if(filedata[p][f]["stat"]["type"]=="file")      # mind files only
                size[filedata[p][f]["stat"]["uid"]]+=filedata[p][f]["stat"]["size"]
    for(i in size)
        print (users[i]?users[i]:i),size[i]    # print username if found else uid
    exit
}

示例输出：

$ ls -l
total 3623
drwxr-xr-x 2 james james  3690496 Mar 21 21:32 100kfiles/
-rw-r--r-- 1 root  root         4 Mar 21 18:52 bar
-rw-r--r-- 1 james james      424 Mar 21 21:33 bar.awk
-rw-r--r-- 1 james james      546 Mar 21 21:19 bar.awk~
-rw-r--r-- 1 james james      315 Mar 21 19:14 foo.awk
-rw-r--r-- 1 james james      125 Mar 21 18:53 foo.awk~
$ awk -v path=. -f bar.awk
root 4
james 1410

另一个：

$ time awk -v path=100kfiles -f bar.awk
root 4
james 342439926

real    0m1.289s
user    0m0.852s
sys     0m0.440s

又一个包含一百万个空文件的测试：

$ time awk -v path=../million_files -f bar.awk

real    0m5.057s
user    0m4.000s
sys     0m1.056s

【讨论】：

看起来我的 awk 没有 filefuncs awk: foo.awk:1: ^ invalid char '@' in expression
是时候升级到现代版本的 GNU awk。
这是在 Enterprise Linux - RHEL 6.10.. 我看到 gawk 指向 /bin/gawk 并且版本是 GNU Awk 3.1.7.. 它是否支持 @loadfiles？.. 还是有其他有另一个 awk 的位置？？..
一个疯狂的猜测，扩展来自 GNU awk 4。但我看到你提到 300k 文件，这个解决方案不能处理那么多。
ok.. 无论如何很高兴知道加载文件...我确实在我的 cygwin 中运行过它并且它可以工作..so ++

【解决方案5】：

获取列表，将大小相加，然后按所有者排序（使用 Perl）

perl -wE'
    chdir (shift // "."); 
    for (glob ".* *") { 
        next if not -f;
        ($owner_id, $size) = (stat)[4,7]
            or do { warn "Trouble stat for: $_"; next };
        $rept{$owner_id} += $size 
    } 
    say (getpwuid($_)//$_, " => $rept{$_} bytes") for sort keys %rept
'

我没有对其进行基准测试，值得尝试使用迭代目录的方法，而不是 glob-ed（虽然我发现 glob 在related problem)。

与ls 相比，我希望运行时更好，因为单个目录中的文件列表变长了，这显着减慢了速度。这是由于系统造成的，所以 Perl 也会受到影响，但据我回忆，它处理得更好。但是，只有当条目达到 50 万左右而不是几千时，我才看到速度急剧下降，所以我不确定为什么它在您的系统上运行缓慢。

如果这需要在它找到的目录中递归，则使用File::Find。例如

perl -MFile::Find -wE'
    $dir = shift // "."; 
    find( sub { 
        return if not -f;
        ($owner_id, $size) = (stat)[4,7] 
            or do { warn "Trouble stat for: $_"; return }; 
        $rept{$owner_id} += $size 
    }, $dir ); 
    say (getpwuid($_)//$_, "$_ => $rept{$_} bytes") for keys %rept
'

这会在 2 秒多一点的时间内扫描一个 2.4 Gb 的目录，其中大部分是子目录层次结构上的小文件。 du -sh 花了大约 5 秒（第一轮）。

将这两者合二为一是合理的

use warnings;
use strict;
use feature 'say';    
use File::Find;
use Getopt::Long;

my %rept;    
sub get_sizes {
    return if not -f; 
    my ($owner_id, $size) = (stat)[4,7] 
        or do { warn "Trouble stat for: $_"; return };
    $rept{$owner_id} += $size 
}

my ($dir, $recurse) = ('.', '');
GetOptions('recursive|r!' => \$recurse, 'directory|d=s' => \$dir)
    or die "Usage: $0 [--recursive] [--directory dirname]\n";

($recurse) 
    ? find( { wanted => \&get_sizes }, $dir )
    : find( { wanted => \&get_sizes, 
              preprocess => sub { return grep { -f } @_ } }, $dir );

say (getpwuid($_)//$_, " => $rept{$_} bytes") for keys %rept;

当非递归运行时（默认情况下），我发现它的执行与上面的 one-dir-only 代码大致相同。

请注意，File::Find::Rule 接口有很多便利，但在一些重要的用例中是slower，这里显然很重要。（该分析应该重做，因为它已经有几年历史了。）

【讨论】：

以及 getpwuid 可能不返回任何内容（从而合并不同的 uid），如果您在 find sub 中调用它，则每个文件调用一次，相比之下，如果您在说。
@jhnc 是的，两者都是：（1）只是添加了错误处理，（2）我不关心处理中的额外系统调用（从系统中获取列表很慢）并且想要收集名称，但是是的，这样会更快（并且通常最好保留stat 返回的名称）
@stack0114106 啊！所以这一定是关于已被删除（或类似）的用户，所以 getwpuid 没有返回任何内容 (undef) --- 提醒总是，确实包括所有必要的测试！（仍然不明白为什么调试打印失败并出现警告“uninitialized value $owner_id”）
@zdim 我创建了一个包含 200k 文件的文件夹，并在 for 中使用 getpwuid 运行您的代码，然后移至 say。第一次用了 2.456s/1.063s/1.369s，第二次用了 0.862s/0.347s/0.515s。这些额外的电话加起来！（至少在 SSD 上......）:-)
顺便说一句，我认为您的 find 版本在正则表达式中有拼写错误 - 应该是 /^\.\.?$/ 或类似

【解决方案6】：

使用datamash（和Stefan Becker's find code）：

find ${dir} -maxdepth 1 -type f -printf "%u\t%s\n" | datamash -sg 1 sum 2

【讨论】：

@agc..答案似乎很简单.. RHEL 6.1 中是否提供 datamash？
@stack0114106，不确定 -- RPM files exist，但如果没有 6.1 框进行测试，这些是否适用于 RHEL 6.1 尚不清楚。