XML文件中的精确字符串搜索？答案

【问题标题】：Exact string search in XML files?XML文件中的精确字符串搜索？
【发布时间】：2016-12-24 18:24:57
【问题描述】：

我需要在一些 XML 文件（它们都具有相同的名称，pom.xml）中搜索以下文本序列（也在子文件夹中），所以如果有人写一些文本甚至是空白，我必须收到警报：

     <!--
     | Startsection
     |-->         
    <!-- 
     | Endsection
     |-->

我正在运行以下 Python 脚本，但仍然不完全匹配，即使它部分是里面的文本，我也会收到警报：

import re
import os
from os.path import join
comment=re.compile(r"<!--\s+| Startsection\s+|-->\s+<!--\s+| Endsection\s+|-->")
tag="<module>"

for root, dirs, files in os.walk("."):

    if "pom.xml" in files:
        p=join(root, "pom.xml") 
        print("Checking",p)
        with open(p) as f:
            s=f.read()
        if tag in s and comment.search(s):
            print("Matched",p)

更新 #3

我希望打印出标签<module>的内容，如果它存在于|--> <!--之间

进入搜索：

 <!--
 | Startsection
 |-->         
 <!-- 
 | Endsection
 |-->

例如在 Matched 之后打印，以及文件的名称，在下面的情况下也打印“example.test1”：

     <!--
     | Startsection
     |-->         
       <module>example.test1</module>
     <!-- 
     | Endsection
     |-->

更新 #4

应该使用以下内容：

import re
import os
from os.path import join
comment=re.compile(r"<!--\s+\| Startsection\s+\|-->\s+<!--\s+\| Endsection\s+\|-->", re.MULTILINE)
tag="<module>"

for root, dirs, files in os.walk("/home/temp/test_folder/"):
 for skipped in ("test1", "test2", ".repotest"):
    if skipped in dirs: dirs.remove(skipped)

 if "pom.xml" in files:
    p=join(root, "pom.xml") 
    print("Checking",p)
    with open(p) as f:
       s=f.read()
       if tag in s and comment.search(s):
          print("The following files are corrupted ",p)

更新 #5

import re
import os
import xml.etree.ElementTree as etree 
from bs4 import BeautifulSoup 
from bs4 import Comment

from os.path import join
comment=re.compile(r"<!--\s+\| Startsection\s+\|-->\s+<!--\s+\| Endsection\s+\|-->", re.MULTILINE)
tag="<module>"

for root, dirs, files in os.walk("myfolder"):
 for skipped in ("model", "doc"):
    if skipped in dirs: dirs.remove(skipped)

 if "pom.xml" in files:
    p=join(root, "pom.xml") 
    print("Checking",p)
    with open(p) as f:
       s=f.read()
       if tag in s and comment.search(s):
          print("ERROR: The following file are corrupted",p)



bs = BeautifulSoup(open(p), "html.parser")
# Extract all comments
comments=soup.find_all(string=lambda text:isinstance(text,Comment))
for c in comments:
    # Check if it's the start of the code
    if "Start of user code" in c:
        modules = [m for m in c.findNextSiblings(name='module')]
        for mod in modules:
            print(mod.text)

【问题讨论】：

请不要使用正则表达式解析 XML。这是一个糟糕的想法，它让经验丰富的程序员哭泣。试试BeautifulSoup 或其底层库lxml
我正在考虑将确切的序列存储在外部文件中。我该如何实施？你能帮我解决这个问题吗？谢谢！
@AdamSmith, ...这里的困难是他们想要找到评论，所以它实际上并没有出现在 DOM 树中。
顺便说一句，当创建一个与旧问题密切相关的新问题时（在这种情况下，stackoverflow.com/questions/38958403/… 的 Python 而非 shell 实例）被认为是包含链接的好形式，并且明确描述它们的区别。
@CharlesDuffy cmets 可以使用 comment() 函数在 XPath 和 XSLT 中进行解析。

标签： python xml

【解决方案1】：

“|()”字符必须转义，同时在正则表达式中添加 re.MULTILINE。

comment=re.compile(r"\s+", re.MULTILINE)

编辑：您还可以在正则表达式中放置换行符：\n

任意（或没有）空白将是：\s*

您可以在此处找到有关 python 正则表达式的更多信息：https://docs.python.org/2/library/re.html

【讨论】：

非常感谢！这是一个很好的解决方案，但可以做得更严格吗？例如，如果我们在第 3 行和第 4 行之间写一个 ENTER ？如果可能的话，我也想介绍一下这种情况
一些提示请按照之前的评论来做？？
是否可以在此输入的第 3 行和第 4 行之间检测到 ENTER？我只能检测到或多或少有一些字符，我还想检测空格或 TAB。谢谢！ :))

【解决方案2】：

不要使用正则表达式解析 XML 文件。 The best Stackoverflow answer ever can explain you why

您可以使用 BeautifulSoup 来帮助完成该任务

看看从你的代码中提取一些东西是多么简单

from bs4 import BeautifulSoup

content = """
    <!--
     | Start of user code (user defined modules)
     |-->

    <!--
     | End of user code
     |-->
"""

bs = BeautifulSoup(content, "html.parser")
print(''.join(bs.contents))

当然你可以使用你的 xml 文件而不是我正在使用的文字

bs = BeautifulSoup(open("pom.xml"), "html.parser")

一个使用预期输入的小例子

from bs4 import BeautifulSoup
from bs4 import Comment

bs = BeautifulSoup(open(p), "html.parser")
# Extract all comments
comments=soup.find_all(string=lambda text:isinstance(text,Comment))
for c in comments:
    # Check if it's the start of the code
    if "Start of user code" in c:
        modules = [m for m in c.findNextSiblings(name='module')]
        for mod in modules:
            print(mod.text)

但是如果你的代码总是在 module 标签中，我不知道你为什么要关心之前/之后的 cmets，你可以在 module 中找到代码strong> 直接标记

【讨论】：

是否有可能我们正在打印的那些情况因为它们匹配，也打印写在 |--> AND 和