使用 Perl 计算文件或目录中所有文件中所有单词的出现次数答案

【问题标题】：Use Perl to count occurrences of all words in a file or in all files in a directory使用 Perl 计算文件或目录中所有文件中所有单词的出现次数
【发布时间】：2014-09-26 00:00:44
【问题描述】：

所以我正在尝试编写一个 Perl 脚本，它将接受 3 个参数。

第一个参数是输入文件或目录。
- 如果是文件，会统计所有单词的出现次数
- 如果是目录，它会递归遍历每个目录，并获取这些目录中文件中所有单词的所有出现次数
第二个参数是一个数字，它将显示出现次数最多的单词的数量。
- 这将仅将每个单词的数字打印到控制台
将它们打印到一个输出文件，该文件是命令行中的第三个参数。

它似乎正在递归搜索目录并查找文件中所有出现的单词并将它们打印到控制台。

如何将这些打印到输出文件，以及如何获取第二个参数，即数字，例如 5，并让它在控制台打印出现次数最多的单词数，而将单词打印到输出文件？

以下是我目前所拥有的：

#!/usr/bin/perl -w

use strict;

search(shift);

my $input  = $ARGV[0];
my $output = $ARGV[1];
my %count;

my $file = shift or die "ERROR: $0 FILE\n";
open my $filename, '<', $file or die "ERROR: Could not open file!";
if ( -f $filename ) {
    print("This is a file!\n");
    while ( my $line = <$filename> ) {
        chomp $line;
        foreach my $str ( $line =~ /\w+/g ) {
            $count{$str}++;
        }
    }
    foreach my $str ( sort keys %count ) {
        printf "%-20s %s\n", $str, $count{$str};
    }
}
close($filename);
if ( -d $input ) {

    sub search {
        my $path = shift;
        my @dirs = glob("$path/*");
        foreach my $filename (@dirs) {
            if ( -f $filename ) {
                open( FILE, $filename ) or die "ERROR: Can't open file";
                while ( my $line = <FILE> ) {
                    chomp $line;
                    foreach my $str ( $line =~ /\w+/g ) {
                        $count{$str}++;
                    }
                }
                foreach my $str ( sort keys %count ) {
                    printf "%-20s %s\n", $str, $count{$str};
                }
            }
            # Recursive search
            elsif ( -d $filename ) {
                search($filename);
            }
        }
    }
}

【问题讨论】：

叹了口气，她回答了几个不错的问题，但这可能是重复的：stackoverflow.com/q/12823971/2019415（可能还有其他人）。但是，该副本没有可接受的答案，因此，如果此处的 OP 选择出现在下面的答案之一.... :-)
侧边栏：在 perl6 中“打高尔夫球”的单线器（来自 Carl Masak++）！ perl6-m -e '.say for (bag slurp.words).pairs.sort(*.value).reverse[^10]' 然后提供一个文件或它们的列表find . -type f -name "*.txt"
perl5 中的 Oneliner 供后代使用：perl -lnE '@ar = split/\s+/; $w{$_}++ for @ar}{ say "$_ $w{$_}" for (sort { $w{$b} <=> $w{$a} } keys %w)[0..10]' ... 解决方案的修改版本由 @go|dfish in #perl-help 提供
如果您想要 Unicode 输入和输出，您可能需要使用 perl -C26 -lnE 运行上面的 oneliner。请参阅perlunicode 了解更多信息。
我想通了，谢谢。我会在一分钟内发布我的代码。

标签： regex perl perlscript

【解决方案1】：

我建议重组您的程序/脚本。您发布的内容很难理解。一些 cmets 可能有助于了解正在发生的事情。我将尝试通过一些代码 sn-ps 来安排事情，希望有助于解释项目。我将介绍您在问题中概述的三个项目。

由于第一个参数可以是文件或目录，我会使用 -f 和 -d 来检查以确定输入是什么。我会使用列表/数组来包含要处理的文件列表。如果它只是一个文件，我会将其推送到处理列表中。否则，我会调用一个例程来返回要处理的文件列表（类似于您的搜索子例程）。比如：

# List file files to process
my @fileList = ();
# if input is only a file
if ( -f $ARGV[0] )
{
  push @fileList,$ARGV[0];
}
# If it is a directory
elsif ( -d $ARGV[0] ) 
{
   @fileList = search($ARGV[0]);
}

因此，在您的搜索子例程中，您需要一个列表/数组，将作为文件的项目推送到该列表/数组上，然后从子例程返回数组（在您处理来自 glob 调用的文件列表之后）。当你有一个目录时，你用路径调用搜索（就像你现在所做的那样），将元素推送到当前数组上，例如

# If it is a file, save it to the list to be returned
if ( -f $filename ) 
{
  push @returnValue,$filename;
}
# else if a directory, get the files from the directory and 
# add them to the list to be returned
elsif ( -d $filename )
{
  push @returnValue, search($filename);
}

获得文件列表后，循环处理每个文件（打开，在关闭时读取行，处理单词的行）。用于处理每一行的 foreach 循环可以正常工作。但是，如果您的单词有句点、逗号或其他标点符号，您可能需要在计算散列中的单词之前删除这些项目。

在下一部分中，您询问了如何确定计数最高的单词。在这种情况下，您想要创建另一个具有计数键（对于每个单词）的散列，并且该散列的值是与该计数相关联的单词列表/数组。类似的东西：

# Hash with key being a number and value a list of words for that number
my %totals= ();
# Temporary variable to store occurrences (counts) of the word
my $wordTotal;
# $w is the words in the counts hash
foreach my $w ( keys %counts ) 
{
  # Get the counts for the word
  $wordTotal = $counts{$w};
  # value of the hash is an array, so de-reference the array ( the @{ }, 
  # and push the value of the counts array onto the array
  push @{ $totals{$wordTotal} },$w;  # the key to total is the value of the count hash
                                     # for which the words ($w) are the keys
}

要获得计数最高的单词，您需要从总数中获取键并反转排序列表（按数字排序）以获得最高的 N 个数。由于我们有一个值数组，因此我们必须对每个输出进行计数以获得最高计数的 N 个。

# Number of items outputted
my $current = 0;
# sort the total (keys) and reverse the list so the highest values are first
# and go through the list
foreach my $t ( reverse sort { $a <=> $b} keys %totals) # Use the numeric 
                                                        # comparison in 
                                                        # the sort 
{
   # Since each value of total hash is an array of words,
   # loop through that array for the values and print out the number 
   foreach my $w ( sort @{$total{$t}}
   {
     # Print the number for the count of words
     print "$t\n";
     # Increment the number output
     $current++;
     # if this is the number to be printed, we are done 
     last if ( $current == $ARGV[1] );
   }
   # if this is the number to be printed, we are done 
   last if ( $current == $ARGV[1] );
 }

打印到文件的第三部分，从您的问题中不清楚“它们”是什么（单词、计数或两者；仅限于顶部单词或所有单词）。我将把这些精力留给您打开文件、将信息打印到文件中并关闭文件。

【讨论】：

【解决方案2】：

这将汇总命令行中给出的目录或文件中单词的出现次数：

#!/usr/bin/env perl
# wordcounter.pl
use strict;
use warnings;
use IO::All -utf8; 
binmode STDOUT, 'encoding(utf8)'; # you may not need this

my @allwords;
my %count;  
die "Usage: wordcounter.pl <directory|filename> number  \n" unless ~~@ARGV == 2 ;

if (-d $ARGV[0] ) {
  push @allwords, $_->slurp for io($ARGV[0])->all_files; 
}
elsif (-f $ARGV[0]) {
  @allwords = io($ARGV[0])->slurp ;
}

while (my $line = shift @allwords) { 
    foreach ( split /\s+/, $line) {
        $count{$_}++
    }
}

my $count_to_show;

for my $word (sort { $count{$b} <=> $count{$a} } keys %count) { 
 printf "%-30s %s\n", $word, $count{$word};
 last if ++$count_to_show == $ARGV[1];  
}

通过修改sort 和/或io 调用，您可以sort { } 按出现次数，按单词字母顺序，针对文件或目录中的所有文件。这些选项很容易作为参数添加。您还可以通过更改foreach ( split /\s+/, $line) 来过滤或更改定义包含在%count 散列中的单词的方式，例如包含foreach ( grep { length le 5 } split /\s+/, $line) 之类的匹配/过滤器，以便仅计算五个或更少字母的单词。

在当前目录下运行示例：

   ./wordcounter ./ 10    
    the                            116
    SV                             87
    i                              66
    my_perl                        58
    of                             54
    use                            54
    int                            49
    PerlInterpreter                47
    sv                             47
    Inline                         47
    return                         46

注意事项

您可能应该为文件 mimetypes、可读性等添加一个测试。
注意unicode
要写入文件，只需在命令行末尾添加> filename.txt ;-)
IO::All 不是标准的 CORE IO 包，我只是在这里做广告和推广 ;-)（你可以把它换掉）
如果您想添加sort_by 选项（-n --numeric、-a --alphabetic 等）Sort::Maker 可能是使该选项易于管理的一种方法。

EDIT 忽略了按 OP 的要求添加选项。

【讨论】：

【解决方案3】：

我想通了。以下是我的解决方案。我不确定这是否是最好的方法，但它确实有效。

    # Check if there are three arguments in the commandline
    if (@ARGV < 3) {
       die "ERROR: There must be three arguments!\n";
       exit;
    }
    # Open the file
    my $file = shift or die "ERROR: $0 FILE\n";
    open my $fh,'<', $file or die "ERROR: Could not open file!";
    # Check if it is a file
    if (-f $fh) {
       print("This is a file!\n");
       # Go through each line
       while (my $line = <$fh>) {
          chomp $line;
          # Count the occurrences of each word
          foreach my $str ($line =~ /\b[[:alpha:]]+\b/) {
             $count{$str}++;
          }
       }
    }

    # Check if the INPUT is a directory
    if (-d $input) {
       # Call subroutine to search directory recursively
       search_dir($input);
    }
    # Close the file
    close($fh);
    $high_count = 0;
    # Open the file
    open my $fileh,'>', $output or die "ERROR: Could not open file!\n";
    # Sort the most occurring words in the file and print them
    foreach my $str (sort {$count{$b} <=> $count{a}} keys %count) {
       $high_count++;
       if ($high_count <= $num) {
          printf "%-31s %s\n", $str, $count{$str};
       }
       printf $fileh "%-31s %s\n", $str, $count{$str};
    }
    exit;

    # Subroutine to search through each directory recursively
    sub search_dir {
       my $path = shift;
       my @dirs = glob("$path/*");
       # Loop through filenames
       foreach my $filename (@dirs) {
          # Check if it is a file
          if (-f $filename) {
             # Open the file
             open(FILE, $filename) or die "ERROR: Can't open file";
             # Go through each line
             while (my $line = <FILE>) {
                chomp $line;
                # Count the occurrences of each word
                foreach my $str ($line =~ /\b[[:alpha:]]+\b/) {
                   $count{$str}++;
                }
             }
             # Close the file
             close(FILE);
          }
          elsif (-d $filename) {
             search_dir($filename);
          }
       }
    }

【讨论】：