如何在 Python 的 xml 文件中找到特定标签？答案

【问题标题】：how do I find specific tag in xml file in Python?如何在 Python 的 xml 文件中找到特定标签？
【发布时间】：2021-12-28 18:09:45
【问题描述】：

我有一个 XML 文件，我尝试在其中找到一个特定的标签。但是标签在hirechcal顺序上是不同的。我尝试找到标签“MotionVectore”，然后计算特定帧类型（P、B 或 I 帧）的平均运动矢量值。在下面我放了这个 XML 文件的一部分：

<Picture id="1" poc="1">
    <GOPNr>0</GOPNr>
    <SubPicture structure="0">
        <Slice num="0">
            <Type>0</Type>
            <TypeString>SLICE_TYPE_P</TypeString>
            <NAL>
                <Num>5</Num>
                <Type>1</Type>
                <TypeString>NALU_TYPE_SLICE</TypeString>
                <Length>47048</Length>
            </NAL>
            <MacroBlock num="0">
                <MotionVector list="0">
                    <RefIdx>0</RefIdx>
                    <Difference>
                        <X>184</X>
                        <Y>149</Y>
                    </Difference>
                    <Absolute>
                        <X>184</X>
                        <Y>149</Y>
                    </Absolute>
                </MotionVector>
                <MotionVector list="0">
                    <RefIdx>0</RefIdx>
                    <Difference>
                        <X>10</X>
                        <Y>0</Y>
                    </Difference>
                    <Absolute>
                        <X>194</X>
                        <Y>149</Y>
                    </Absolute>
                </MotionVector>
                <Position>
                    <X>0</X>
                    <Y>0</Y>
                </Position>
                <QP_Y>21</QP_Y>
                <Type>1</Type>
                <TypeString>P_L0_L0_16x8</TypeString>
                <PredModeString>BLOCK_TYPE_P</PredModeString>
                <SkipFlag>0</SkipFlag>
            </MacroBlock>
            <MacroBlock num="1">
                <SubMacroBlock num="0">
                    <Type>0</Type>
                    <TypeString>P_L0_8x8</TypeString>
                    <MotionVector list="0">
                        <RefIdx>0</RefIdx>
                        <Difference>
                            <X>8</X>
                            <Y>-1</Y>
                        </Difference>
                        <Absolute>
                            <X>192</X>
                            <Y>148</Y>
                        </Absolute>
                    </MotionVector>
                </SubMacroBlock>
            </MacroBlock>
         </Slice>
        </SubPicture>
</Picture>

如您所见，实现X 和Y 值的标签顺序为Picture/SubPicture/Slice/MacroBlock/MotionVector/Absolute/X，但有时此顺序为Picture/SubPicture/Slice/MacroBlock/SubMacroBlock/MotionVector/Absolute/X所以当我使用此代码时

 abs_x_tag=list(qpy_node.text for qpy_node in root.findall('Picture/SubPicture/Slice/MacroBlock/SubMacroBlock/MotionVector/Absolute/X'))

要提取所有X 值，它不能提取所有X 值，而且我必须根据此标签计算不同帧类型的运动向量

<TypeString>SLICE_TYPE_P</TypeString>

基于这些限制，我不知道如何分别提取每种帧类型的 X 和 Y 值。我可以使用上述代码提取所有 X 和 Y 值，但我不知道如何根据框架类型找到这些值。你能帮我解决这个问题吗？谢谢。

【问题讨论】：

能否附上完整的 XML 节点和所需输出的示例。
很遗憾，我无法附加我的文件。这个 XML 文件大约 26 Mb，当我想使用图像附加添加它时，我无法附加它。如何附加我的 XML 文件？
提取文件的 VALID 子集并发布。

标签： python python-3.x xml xml-parsing

【解决方案1】：

这里有一个例子，你如何用BeautifulSoup解析这个xml

安装 BeautifulSoup 和 lxml

pip install BeautifulSoup4 lxml

代码：

from bs4 import BeautifulSoup


XML = """
<Picture id="1" poc="1">
        <GOPNr>0</GOPNr>
        <SubPicture structure="0">
            <Slice num="0">
                <Type>0</Type>
                <TypeString>SLICE_TYPE_P</TypeString>
                <NAL>
                    <Num>5</Num>
                    <Type>1</Type>
                    <TypeString>NALU_TYPE_SLICE</TypeString>
                    <Length>47048</Length>
                </NAL>
                <MacroBlock num="0">
                    <MotionVector list="0">
                        <RefIdx>0</RefIdx>
                        <Difference>
                            <X>184</X>
                            <Y>149</Y>
                        </Difference>
                        <Absolute>
                            <X>184</X>
                            <Y>149</Y>
                        </Absolute>
                    </MotionVector>
                </MacroBlock>
            </Slice>
        </SubPicture>
</Picture>
"""

soup = BeautifulSoup(XML, 'xml')

slices = soup.find_all('Slice')
for slice in slices:
    type = slice.find('TypeString').text
    print(f"Type: {type}")
    vectors = slice.find_all('MotionVector')
    for vector in vectors:
        print("Vector:")
        difference = vector.find('Difference')
        difference_x = difference.find('X').text
        difference_y = difference.find('Y').text

        absolute = vector.find('Absolute')
        absolute_x = absolute.find('X').text
        absolute_y = absolute.find('Y').text

        # Here you know type and x, y and type

        print(f"Difference: {difference_x}, {difference_y}")
        print(f"Absolute: {absolute_x}, {absolute_y}")

输出：

Type: SLICE_TYPE_P
Vector:
Difference: 184, 149
Absolute: 184, 149

【讨论】：

什么是 XML？我把 xml 文件的地址而不是 'XML' 但它产生了这个错误：soup = BeautifulSoup('E:/UGCVIDEOS/9_9.xml', 'xml') D:\software\Anaconda3\envs\py37\lib\ site-packages\bs4_init_.py:350: MarkupResemblesLocatorWarning: "E:/UGCVIDEOS/9_9.xml" 看起来像文件名，而不是标记。您可能应该打开这个文件并将文件句柄传递给 Beautiful Soup。 MarkupResemblesLocatorWarning
@david 您应该将文件内容传递给 BeautifulSoup：text = open('E:/UGCVIDEOS/9_9.xml').read()，然后是 soup = BeautifulSoup(text, 'xml')
它可以，但我必须分析大约 2000 个 XML 文件，而且非常耗时，您有什么建议可以加快速度吗？
@david 您可以将 xml 文件按多个文件夹拆分并在单独的 python 进程中运行。因此，您可以利用所有 CPU 内核。在几个文件上测试您的解决方案，然后将进程留一个晚上 :) 另外，请务必制作一些进度日志，例如“已处理 500 个文件中的 5 个...”
@Eugenij 谢谢。我对 Python 很陌生，所以我不熟悉拆分 XML 文件和您所说的其他建议。你能帮我解决这个问题或介绍任何关于这个的教程吗？

【解决方案2】：

我们可以用一种简单的方式来做，看看下面的输出：

import xml.etree.ElementTree as ET

SampleXML = """
<Picture id="1" poc="1">
        <GOPNr>0</GOPNr>
        <SubPicture structure="0">
            <Slice num="0">
                <Type>0</Type>
                <TypeString>SLICE_TYPE_P</TypeString>
                <NAL>
                    <Num>5</Num>
                    <Type>1</Type>
                    <TypeString>NALU_TYPE_SLICE</TypeString>
                    <Length>47048</Length>
                </NAL>
                <MacroBlock num="0">
                    <MotionVector list="0">
                        <RefIdx>0</RefIdx>
                        <Difference>
                            <X>184</X>
                            <Y>149</Y>
                        </Difference>
                        <Absolute>
                            <X>184</X>
                            <Y>149</Y>
                        </Absolute>
                    </MotionVector>
                </MacroBlock>
            </Slice>
        </SubPicture>
</Picture>
"""
# use below commented lines if you are reading from xml file and replace XMl absolute path with <InputXML>
# tree = ET.parse(r"<InputXML>")
# root = tree.getroot()
root = ET.fromstring(SampleXML)
TypeString = root.findall("./SubPicture/Slice/TypeString")
print("TypeString: ", TypeString[0].text)
abs_x_tag = root.findall("./SubPicture/Slice/MacroBlock/MotionVector/Absolute/X") or root.findall("./SubPicture/Slice/MacroBlock/SubMacroBlock/MotionVector/Absolute/X")
print("abs_x_tag: ", abs_x_tag[0].text)

输出：

类型字符串：SLICE_TYPE_P

abs_x_tag: 184

【讨论】：