从 shell 中查找不包含特定注释的 XML 文件答案

【问题标题】：Find XML files non containing a specific comment from shell从 shell 中查找不包含特定注释的 XML 文件
【发布时间】：2016-12-21 20:31:03
【问题描述】：

我想搜索 (awk/grep/sed) 到几个 XML 文件 (pom.xml 文件) 中跳过一些文件夹。此外，第一个条件是它们必须包含标签<module>。对于这些情况，我想打印出那些不包含以下确切序列的内容（它是自动生成的代码 - 它将帮助我检测是否有人修改了该序列）：

  <!--
         | Start of user code (user defined modules)
         |-->
        <!-- 
         | End of user code
         |-->

我被困在这里：

        fileArray=($(find . -type f -not -path "./folder1/*" -not -path "*/folder2/*" -not -path "./folder3/*" -name "pom.xml" \
                    | xargs awk -v RS='^$' 'match($0,/\<module>[^\n]+/,a){print a[0]}'))

请给点建议？

---更新：

  #!/bin/sh

###########################################################
# Checks for "user code" <modules> defined in pom files.
###########################################################

function check()
{
              # http://www.cyberciti.biz/tips/handling-filenames-with-spaces-in-bash.html

        OLDIFS=$IFS
        IFS=$'\n'

        # Read all pom files into an array
        # - Search for user code modules: It searches for the tag <module> into the pom files and in case they contain modules,
        #checks if the autogenerated section has been modified. Reading text secuence from foo.txt file
        #
        # - Exclude model folder as the codegen poms therein require such a repository



        fileArray=($(find . -type f -not -path "./folder1/*" -not -path "*/folder2/*" -not -path "./folder3/*" -name "pom.xml" \
                         | xargs `awk -v RS='^$' 'NR==FNR{str=$0;next} /<module>/ && !index($0,str){print FILENAME}' sequence {} +`))


        IFS=$OLDIFS

        # get length of an array
        numberOfFiles=${#fileArray[@]}

        # read all filenames
        for (( i=0; i<${numberOfFiles}; i++ ));
        do
          echo "ERROR:Found user code modules (file:line:occurrence): ${fileArray[$i]}"
        done


    if [ "$numberOfFiles" != "0" ]; then
        echo "SUMMARY:Found $numberOfFiles pom.xml file(s) containing user code modules."
        exit 1
    fi
}

check

----UPDATE（最后一个控制台输出）

    :~/temp> bash script.sh
awk: cmd. line:1: fatal: cannot open file `{}' for reading (No such file or directory)
ERROR:Found user code modules (file:line:occurrence): ./test_folder/test4/pom.xml ./tes                                                                        t_folder/test1/pom.xml ./test_folder/test2/pom.xml ./test_folder/test3/pom.xml
SUMMARY:Found 1 pom.xml file(s) containing user code modules.

【问题讨论】：

我建议使用 XML/HTML 解析器 (xmllint, xmlstarlet ...)。
从一个文件的脚本开始（没有 awk/find）。
我亲眼目睹的最严重的数据丢失事件是由假设（关键计费）日志遵循特定命名约定的备份维护代码引起的。缓冲区溢出将垃圾转储到文件名中，垃圾包括被空格包围的*，脚本删除了目录中的每个日志文件。如果您只编写代码来处理您认为可能发生的情况，那么您就是在您认为不可能发生的地方编写错误。
...如果您在认为脚本的正确性不重要的情况下粗心大意，您真的认为您能够突然遵循良好的习惯和实践吗？一年中的某一天，您在做真正重要的事情，而没有养成在其余时间关注稳健实践的习惯？
在上述数据丢失的情况下也非常严格——[0-9a-f]{24} 几乎和他们来的时候一样严格。顺便说一句，您正在搜索评论而不是语义数据这一事实至关重要——它有助于将您与“仅使用 XMLStarlet / xmllint”的答案隔离开来——因此我已将其修改为标题。跨度>

标签： xml bash awk grep

【解决方案1】：

将该文本存储在名为 foo 的文件中，然后运行：

find ... -exec awk -v RS='^$' 'NR==FNR{str=$0;next} /<module>/ && !index($0,str){print FILENAME}' foo {} +

使用任何适合您的查找选项来获取 XML 文件列表。是否使用 -exec 或 pipe 到 xargs 取决于您，我实际上只是在解决 awk 部分，因为这似乎是您遇到的问题。

上面使用 GNU awk 进行多字符 RS 并严格搜索 foo 的全部内容，在每个 XML 文件中完全以字符串形式出现，并打印包含 @ 的任何文件的名称987654323@ 但不包含该字符串。

如果这不符合您的要求，请编辑您的问题以显示更完整的示例输入/输出示例，包括您要在输入文件中在上下文中搜索的文本。 p>

【讨论】：

这是否可以每次调用只处理一个 XML 文件，或者我们可以将其设为 -exec ... {} + 以最小化 awk 调用的数量？
好点。我将更新它以使用{} +，因为 awk 脚本并不关心它作为参数调用了多少 XML 文件。我在回避find 详细信息:-)。
非常感谢您教我提前详细的答案。我明天上午 10 点（UTC/GMT）检查并告诉你。不过看起来很有希望:)
awk：致命：无法打开文件“foo”进行读取（没有这样的文件或目录）-:((正在处理它...
不，这不是问题所在。您正在混淆 -exec awk '...' foo {} + 语法和 | xargs awk '...' foo 语法。我知道什么是“Pom”。您必须创建名为foo 的文件，请参阅我的回答的第一句话。在没有 find 的情况下在几个文件上尝试 awk 脚本，以确保它适用于您，然后添加 find。

【解决方案2】：

您可以使用xmllint 使用xpath 搜索节点

xmllint --xpath '//module' */pom.xml

它的返回码可以告诉你它何时被发现 (0) 或不被发现 (!= 0)。

【讨论】：

谢谢 Diego，如何使用 Xmllint 检测该序列？