【发布时间】:2021-07-28 20:04:19
【问题描述】:
我有以下 XML。我正在尝试在每个 <TextRegion> 标签之间提取 <Unicode> 中的文本。因此,每次打开和关闭</TextRegion> 的<TextRegion> 标签时,它们之间可能会有一个标签<Unicode> Sample Text </Unicode>。
我正在尝试提取此文本并将它们存储在单独的列表中。有人可以帮我吗。我使用 Elementtree 尝试了一些东西,但我完全迷失了。
<Page imageFilename="0000223278.tif" imageHeight="1000" imageWidth="762">
<TextRegion id="Page1_TopMargin">
<Property key="Margin" value="Top"/>
<Coords points="0,0 762,0 762,14 0,14"/>
</TextRegion>
<TextRegion id="Page1_LeftMargin">
<Property key="Margin" value="Left"/>
<Coords points="0,14 100,14 100,701 0,701"/>
</TextRegion>
<TextRegion id="Page1_RightMargin">
<Property key="Margin" value="Right"/>
<Coords points="677,14 762,14 762,701 677,701"/>
</TextRegion>
<TextRegion id="Page1_BottomMargin">
<Property key="Margin" value="Bottom"/>
<Coords points="0,701 762,701 762,1000 0,1000"/>
</TextRegion>
<TextRegion id="Page1_PrintSpace">
<Property key="Margin" value=""/>
<Coords points="100,14 677,14 677,701 100,701"/>
</TextRegion>
<TextRegion id="Page1_Block1">
<Property key="language" value="en-US"/>
<Coords points="247,26 277,26 277,51 247,51"/>
<TextLine id="Page1_Block1_l1">
<Coords points="247,26 275,26 275,49 247,49"/>
<Word id="Page1_Block1_l1_w1">
<Coords points="247,26 275,26 275,49 247,49"/>
<TextEquiv conf="0.1650000066">
<Unicode>r></Unicode>
</TextEquiv>
</Word>
</TextLine>
</TextRegion>
<ImageRegion id="Page1_Block2">
<Coords points="476,14 501,14 501,59 476,59"/>
</ImageRegion>
<TextRegion id="Page1_Block3">
<Property key="ComposedBlock" value="Page1_Block4 Page1_Block5"/>
<Coords points="100,73 476,73 476,123 100,123"/>
</TextRegion>
<ImageRegion id="Page1_Block4">
<Coords points="100,73 148,73 148,113 100,113"/>
</ImageRegion>
<TextRegion id="Page1_Block5">
<Property key="language" value="en-US"/>
<Coords points="155,75 476,75 476,123 155,123"/>
<TextLine id="Page1_Block5_l1">
<Coords points="158,77 471,77 471,93 158,93"/>
<Word id="Page1_Block5_l1_w1">
<Coords points="158,77 171,77 171,90 158,90"/>
<TextEquiv conf="0.4300000072">
<Unicode>B</Unicode>
</TextEquiv>
</Word>
<Word id="Page1_Block5_l1_w2">
<Coords points="175,77 210,77 210,91 175,91"/>
<TextEquiv conf="0.6600000262">
<Unicode>AT</Unicode>
</TextEquiv>
</Word>
<Word id="Page1_Block5_l1_w3">
<Coords points="214,77 262,77 262,92 214,92"/>
<TextEquiv conf="0.55400002">
<Unicode>(U.K.</Unicode>
</TextEquiv>
</Word>
<Word id="Page1_Block5_l1_w4">
<Coords points="267,82 301,82 301,91 267,91"/>
<TextEquiv conf="0.4833333194">
<Unicode>and</Unicode>
</TextEquiv>
</Word>
<Word id="Page1_Block5_l1_w5">
<Coords points="307,77 404,77 404,93 307,93"/>
<TextEquiv conf="0.4828571379">
<Unicode>EXPORT)</Unicode>
</TextEquiv>
</Word>
<Word id="Page1_Block5_l1_w6">
<Coords points="408,83 471,83 471,92 408,92"/>
<TextEquiv conf="0.3557142913">
<Unicode>limited</Unicode>
</TextEquiv>
</Word>
</TextLine>
<TextLine id="Page1_Block5_l2">
<Coords points="158,110 471,110 471,121 158,121"/>
<Word id="Page1_Block5_l2_w1">
<Coords points="158,110 201,110 201,120 158,120"/>
<TextEquiv conf="0.5533333421">
<Unicode>Export</Unicode>
</TextEquiv>
</Word>
<Word id="Page1_Block5_l2_w2">
<Coords points="205,110 242,110 242,119 205,119"/>
<TextEquiv conf="0.3759999871">
<Unicode>House</Unicode>
</TextEquiv>
</Word>
<Word id="Page1_Block5_l2_w3">
<Coords points="250,110 297,110 297,120 250,120"/>
<TextEquiv conf="0.2683333457">
<Unicode>Woking</Unicode>
</TextEquiv>
</Word>
<Word id="Page1_Block5_l2_w4">
<Coords points="305,110 347,110 347,121 305,121"/>
<TextEquiv conf="0.6050000191">
<Unicode>Surrey</Unicode>
</TextEquiv>
</Word>
<Word id="Page1_Block5_l2_w5">
<Coords points="351,110 412,110 412,120 351,120"/>
<TextEquiv conf="0.4314285815">
<Unicode>GU211YB</Unicode>
</TextEquiv>
</Word>
<Word id="Page1_Block5_l2_w6">
<Coords points="420,110 471,110 471,121 420,121"/>
<TextEquiv conf="0.3928571343">
<Unicode>England</Unicode>
</TextEquiv>
</Word>
</TextLine>
</TextRegion>
</Page>
【问题讨论】:
-
包括您尝试过的代码,以及示例 XML 的准确预期结果。
标签: python-3.x xml xml-parsing lxml elementtree