搜索文件中的一组字符串是否存在于另一个文件中答案

【问题标题】：search a group of string in a file is present in another file or not搜索文件中的一组字符串是否存在于另一个文件中
【发布时间】：2013-11-26 04:34:21
【问题描述】：

我正在写 perl 脚本，其中基本上想要打开一个包含许多字符串的文件（一行中的一个字符串）并比较这些字符串中的每一个是否存在于另一个文件（搜索文件）中并打印它的每次出现。我已经为一个特定的字符串查找编写了以下代码。如何改进它以获取文件中的字符串列表。

open(DATA, "<filetosearch.txt") or die "Couldn't open file filetosearch.txt for reading: $!";
my $find = "word or string to find";
#open FILE, "<signatures.txt";
my @lines = <DATA>;
print "Lined that matched $find\n";
for (@lines) {
    if ($_ =~ /$find/) {
        print "$_\n";
    }
}

【问题讨论】：

两个文件中的字符串是否可以同时放入内存？
您打开 filetosearch.txt 是为了写入，而不是读取。
是的，这些文件大约有 500 行，适合内存。好的，我将 filetosearch.txt 更正为只读为：open(DATA, "
grep -F -f signatures.txt filetosearch.txt
grep -C3 -F -x -f file1 file2

标签： perl

【解决方案1】：

我会尝试这样的：

use strict;
use warnings;
use Tie::File;

tie my @lines, 'Tie::File', 'filetosearch.txt';
my @matched;
my @result;
tie my @patterns, 'Tie::File', 'patterns.txt';
foreach my $pattern (@patterns)
{
    $pattern = quotemeta $pattern;
    @matched = grep { /$pattern/ } @lines;
    push @result, @matched;
}

我使用 Tie::File，因为它很方便（不是特别是在这种情况下，而是在其他情况下），其他人（也许很多其他人？）会不同意，但这并不重要
grep 是一个核心功能，它非常擅长它的功能（根据我的经验）

【讨论】：

+1 建议在这种情况下使用 Tie::File。对于大文件，它可能会非常慢，但 OP 并没有处理这种情况。考虑/\Q$pattern\E/，因为字符串中可能存在元字符。
@Kenosis 谢谢Kenosis，我插入了你的建议。我个人更喜欢使用quotemeta，因为我认为它提高了可读性。
不客气！您的可读性点是有道理的。你当然可以在grepping 之前$pattern = quotemeta $pattern;。

【解决方案2】：

好的，这样会更快。

sub testmatch
{
  my ($find, $linesref)= @_ ;

  for ( @$linesref ) { if ( $_ =~ /$find/ ) { return 1 ; } }
  return 0 ;
}

{
  open(DATA, "<filetosearch.txt") or die "die" ;
  my @lines = <DATA> ;

  open(SRC, "tests.txt") ;
  while (<SRC>)
  {
    if ( testmatch( $_, \@lines )) { print "a match\n" }
  }
}

如果它的整行与整行匹配，您可以将这一行打包为哈希的键并仅测试存在性：

{
  open(DATA, "<filetosearch.txt") or die "die" ;
  my %lines ;
  @lines{<DATA>}= undef ;

  open(SRC, "tests.txt") ;
  while (<SRC>)
  {
     if ($_ ~~ %lines) { print "a match\n" }
  }
}

【讨论】：

永远不会失败use strict; use warnings;。
其实我只用 5.012 ;
非常感谢 Woolstar。但是我忘了包括我想从匹配模式中打印之前的 3 行。我该怎么做？
@woolstar：启用严格但不启用警告。
@woolstar：使用use <version> 来表示“此代码使用Perl 之前不可用的功能” 之外的任何内容是非常糟糕的做法。您期望人们在他们的脑海中随身携带一个目录，其中包含每个此类 use 所暗示的附加功能（use 5.10 不包括 use strict），并且有人删除它是合理的如果很明显代码中没有任何内容依赖于所述版本，则该声明。

【解决方案3】：

也许这样的事情可以完成这项工作：

open FILE1, "filetosearch.txt";
my @arrFileToSearch = <FILE1>;
close FILE1;

open FILE2, "signatures.txt";
my @arrSignatures = <FILE2>;
close FILE2;

for(my $i = 0; defined($arrFileToSearch[$i]);$i++){
    foreach my $signature(@arrSignatures){
        chomp($signature);
        $signature = quotemeta($signature);#to be sure you are escaping special characters
        if($arrFileToSearch[$i] =~ /$signature/){
            print $arrFileToSearch[$i-3];#or any other index that you want
        }
    }

}

【讨论】：

【解决方案4】：

这是另一个选择：

use strict;
use warnings;

my $searchFile = pop;
my @strings    = map { chomp; "\Q$_\E" } <>;
my $regex      = '(?:' . ( join '|', @strings ) . ')';

push @ARGV, $searchFile;

while (<>) {
    print if /$regex/;
}

用法：perl script.pl strings.txt searchFile.txt [>outFile.txt]

最后一个可选参数将输出定向到文件。

首先，搜索文件的名称是（隐式）popped off @ARGV 并保存以备后用。然后读取字符串的文件 (<>) 和 map 用于 chomp 每行，转义元字符（\Q 和 \E，以防可能存在正则表达式字符，例如，a '.' 或 '*' 等，在字符串中）然后这些行被传递给一个数组。数组的元素是 joined 与正则表达式交替字符 (|) 以有效地形成将与搜索文件的每一行匹配的所有字符串的 OR 语句。接下来，搜索文件的名称是pushed 到@ARGV，因此可以搜索其行。同样，如果在该行中找到其中一个字符串，则每行是 chomped 和 printed。

希望这会有所帮助！

【讨论】：