并行化 grep - 使用文件行作为 grep 的输入答案

【问题标题】：Parallelise grep - use file rows as input for grep并行化 grep - 使用文件行作为 grep 的输入
【发布时间】：2013-07-30 17:22:00
【问题描述】：

我有File1 和File2 如下。我发现了类似的问题，但并不完全相同。

使用File1 行作为grep 的输入并提取File2 的第一列。在下面的玩具示例中，如果 File2 中的 column2 等于 a 或 b，则将 1 写入 File_ab。

到目前为止，我使用的是双循环，估计时间是 4 天。我希望得到类似的东西：cat File1 | xargs -P 12 -exec grep "$1\|$2" File2 > File_$1$2.txt 但未能正确使用语法。我正在尝试与OR 条件并行运行 12 个greps。

File1
a b
c d

File2
1 a
2 b
3 c
1 d
4 a
5 e
6 d

想要的输出是 2 个文件，File_ab 和 File_cd：

File_ab
1
2
4
File_cd
1
3
6

注意：我的File1 是 25K 行，File2 是 10Mln 行。

【问题讨论】：

标签： parallel-processing grep xargs hpc

【解决方案1】：

使用 perl：

#!/usr/bin/perl                                                                                               

use FileCache;

@a=`cat File1`;
chomp(@a);
for $a (@a) {
    @parts = split/ +/,$a;
    push @re, @parts;
    for $p (@parts) {
    $file{$p} = "File_".join "",@parts;
    }
}

$re = join("|",@re);

while(<>) {
    if(/(\d+).*($re)/o and $file{$2}) {
    $fh = cacheout $file{$2};
    print $fh $1,"\n";
    }
}

然后：

chmod 755 myscript
./myscript File2

【讨论】：

对perl不熟悉，能简单介绍一下逻辑是什么吗？
错误：Can't create : No such file or directory at ./myscript line 17 第 17 行是 $fh = cacheout $file{$2};
编辑了答案。立即尝试。