帮助使用 perl 代码来解析文件答案

【问题标题】：help with perl code to parse a file帮助使用 perl 代码来解析文件
【发布时间】：2011-07-02 06:44:26
【问题描述】：

我是 Perl 新手，对语法有疑问。我收到了用于解析包含特定信息的文件的代码。我想知道子程序get_number 的if (/DID/) 部分在做什么？这是利用正则表达式吗？我不太确定，因为正则表达式匹配看起来像$_ =~ /some expression/。最后，get_number子程序中的while循环是否必要？

#!/usr/bin/env perl

use Scalar::Util qw/ looks_like_number /;
use WWW::Mechanize;

# store the name of all the OCR file names in an array
my @file_list=qw{
   blah.txt
};

# set the scalar index to zero
my $file_index=0;

# open the file titled 'outputfile.txt' and write to it
# (or indicate that the file can't be opened)
open(OUT_FILE, '>', 'outputfile.txt')
    or die "Can't open output file\n";

while($file_index < 1){
    # open the OCR file and store it in the filehandle IN_FILE
    open(IN_FILE, '<', "$file_list[$file_index]")
        or die "Can't read source file!\n";

    print "Processing file $file_list[$file_index]\n";
    while(<IN_FILE>){
            my $citing_pat=get_number();
            get_country($citing_pat);
    }
    $file_index=$file_index+1;
}
close IN_FILE;
close OUT_FILE;

get_number的定义如下。

sub get_number {
    while(<IN_FILE>){
        if(/DID/){
            my @fields=split / /;
            chomp($fields[3]);
            if($fields[3] !~ /\D/){
                return $fields[3];
            }
        }
    }
}

【问题讨论】：

标签： regex perl parsing file-io screen-scraping

【解决方案1】：

Perl 有一个variable $_，这是很多东西的默认垃圾场。

在get_number 中，while(<IN_FILE>){ 将一行读入$_，下一行检查$_ 是否与正则表达式DID 匹配。

如果没有给出参数，chomp; 也很常见，它也会在 $_ 上运行。

【讨论】：

我在这里找到了更彻底的答案：What is the significance of an underscore in Perl ($_, @_)?

【解决方案2】：

在这种情况下，if (/DID/) 默认搜索$_ 变量，所以它是正确的。然而，它是一个相当松散的正则表达式，IMO。

sub 中的 while 循环可能是必要的，这取决于您输入的内容。您应该知道，两个 while 循环会导致某些行被完全跳过。

主程序中的 while 循环将占用一行，并且不执行任何操作。基本上，这意味着文件中的第一行以及匹配行之后的每一行（例如，包含“DID”且第 4 个字段是数字的行）也将被丢弃。

为了正确回答这个问题，我们需要查看输入文件。

此代码存在许多问题，如果它按预期工作，可能是由于运气好。

下面是代码的清理版本。我保留了这些模块，因为我不知道它们是否在其他地方使用。我还保留了输出文件，因为它可能会在您未显示的地方使用。此代码不会尝试为get_country 使用未定义的值，如果找不到合适的数字，它将不做任何事情。

use warnings;
use strict;
use Scalar::Util qw/ looks_like_number /;
use WWW::Mechanize;

my @file_list=qw{ blah.txt };

open(my $outfile, '>', 'outputfile.txt') or die "Can't open output file: $!";

for my $file (@file_list) {
    open(my $in_file, '<', $file) or die "Can't read source file: $!";
    print "Processing file $file\n";
    while (my $citing_pat = get_number($in_file)) {
        get_country($citing_pat);
    }
}
close $out_file;

sub get_number {
    my $fh = shift;
     while(<$fh>) {
            if (/DID/) {
                    my $field = (split)[3];
                    if($field =~ /^\d+$/){
                return $field;
                    }
            }
     }
    return undef;
}

【讨论】：