在几百个日志文件中搜索几百个文件名答案

【问题标题】：Search for a few hundred filenames in a few hundred log files在几百个日志文件中搜索几百个文件名
【发布时间】：2013-09-25 00:15:45
【问题描述】：

我想在几百个日志文件中高效搜索约 200 个文件名。

我可以使用grep 的-f 指令轻松做到这一点，并将针头放入文件中。

但是，有几个问题：

我有兴趣高效地执行此操作，如 How to use grep efficiently?
我想分别了解所有日志文件中每个搜索词（即文件名）的所有匹配项。 grep -f 会匹配，因为它会在每个文件中找到针。
我想知道文件名何时不匹配。

2.7 i7 MBP w/16gb 内存

使用grep -ron -f needle * 给我：

access_log-2013-01-01:88298:google
access_log-2013-01-01:88304:google
access_log-2013-01-01:88320:test
access_log-2013-01-01:88336:google
access_log-2013-01-02:396244:test
access_log-2013-01-02:396256:google
access_log-2013-01-02:396262:google

其中needle 包含：

google
test

这里的问题是在整个目录中搜索来自needle 的任何匹配项，并且该进程是单线程的，因此需要很长时间。也没有关于它是否无法找到匹配项的明确信息。

【问题讨论】：

文件名中是否包含空格？此外，有时文件名会附加到其他文本还是总是由空格/行首/行尾分隔？
这个脚本的输出是什么样子的？
@Desidero 文件名不包含空格。文件名可以附加到其他文本。想想 /foor/bar/baz/needle.txt
@michael 不确定我是否关注。
@kayaker243，假设您有解决此问题的方法，输出结果如何。给我们一个输入输出的例子

标签： multithreading bash grep

【解决方案1】：

在 bash 脚本中组合 grep 和 find 怎么样？

for needle in $(cat needles.txt); do
    echo $needle
    matches=$(find . -type f -exec grep -nH -e $needle {} +)
    if [[ 0 == $? ]] ; then
        if [[ -z "$matches" ]] ; then
            echo "No matches found"
        else
            echo "$matches"
        fi
    else
        echo "Search failed / no matches"
    fi
    echo
done

needles.txt 包含您的目标文件名列表。

逐行读取针（现在可以包含空格）文件，使用这个版本：

cat needles.txt | while read needle ; do
    echo $needle
    matches=$(find . -type f -exec grep -nH -e "$needle" {} +)
    if [[ 0 == $? ]] ; then
        if [[ -z "$matches" ]] ; then
            echo "No matches found"
        else
            echo "$matches"
        fi
    else
        echo "Search failed / no matches"
    fi
    echo
done

如果与xargs 组合，错误代码$?即使成功也不再为零。这可能不太安全，但对我有用：

cat needles.txt | while read needle ; do
  echo $needle
  matches=$(find . -type f -print0 | xargs -0 -n1 -P2 grep -nH -e "$needle")
  if [[ -z "$matches" ]] ; then
        echo "No matches found"
  else
        echo "$matches"
  fi
  echo
done

【讨论】：

谢谢！我稍微修改了它以使用 xargs 将 grep 传播到 8 个进程。 matches=$(find . -type f -print0 | xargs -0 -n1 -P8 grep -nH -E $needle)。这似乎有效。然而，事实证明我确实需要匹配空格——我要搜索的词实际上是GET /term/。在 needles.txt 中的术语之前包含反斜杠失败，似乎退出执行。引用 $needle 似乎阻止了对 $needle 的评估。有什么建议吗？
@kayaker243 你能根据你的需要调整编辑中的版本吗？感谢您指出与 xargs 的并行化，这对我来说是新的。
不，我的 bash 技能无法应对在这种情况下处理空格的挑战 :(
上面的第二个版本是否不适用于包含空格的针？只需将GET /term/ 放在needles.txt 中的单行上即可。
是的，将带有空格的针放在自己的行上会导致脚本死亡。

【解决方案2】：

要确定哪些针不再匹配，您可以从 grep 中获取输出：

使用 awk 或类似的工具将匹配的字符串提取到单独的文件中。
将该针文件连接到该文件
做sort --uniq filename -o temp1
将 needles 文件连接到 temp1
做sort temp1 -o temp2
uniq -u temp2 > temp3

temp3 将包含不再使用的针。

可能有更简洁的方法来做到这一点。步骤 1 到 3 获取在文件中找到的唯一针的列表。

假设你的 needles 文件包含：

google
foo
bar

并且 grep 在多个文件中找到 foo 和 bar，但没有找到 google。第 1 步将创建一个文件，如：

foo
bar
bar
foo
foo
bar
foo

sort --uniq 将创建：

foo
bar

连接 needles 文件

foo
bar
google
foo
bar

排序给出：

bar
bar
foo
foo
google

而最后的uniq -u 命令将输出一行：

google

【讨论】：