【问题标题】:Need to extract text between tags需要提取标签之间的文本
【发布时间】:2021-07-28 20:04:19
【问题描述】:

我有以下 XML。我正在尝试在每个 <TextRegion> 标签之间提取 <Unicode> 中的文本。因此,每次打开和关闭</TextRegion><TextRegion> 标签时,它们之间可能会有一个标签<Unicode> Sample Text </Unicode>

我正在尝试提取此文本并将它们存储在单独的列表中。有人可以帮我吗。我使用 Elementtree 尝试了一些东西,但我完全迷失了。

<Page imageFilename="0000223278.tif" imageHeight="1000" imageWidth="762">
    <TextRegion id="Page1_TopMargin">
      <Property key="Margin" value="Top"/>
      <Coords points="0,0 762,0 762,14 0,14"/>
    </TextRegion>
    <TextRegion id="Page1_LeftMargin">
      <Property key="Margin" value="Left"/>
      <Coords points="0,14 100,14 100,701 0,701"/>
    </TextRegion>
    <TextRegion id="Page1_RightMargin">
      <Property key="Margin" value="Right"/>
      <Coords points="677,14 762,14 762,701 677,701"/>
    </TextRegion>
    <TextRegion id="Page1_BottomMargin">
      <Property key="Margin" value="Bottom"/>
      <Coords points="0,701 762,701 762,1000 0,1000"/>
    </TextRegion>
    <TextRegion id="Page1_PrintSpace">
      <Property key="Margin" value=""/>
      <Coords points="100,14 677,14 677,701 100,701"/>
    </TextRegion>
    <TextRegion id="Page1_Block1">
      <Property key="language" value="en-US"/>
      <Coords points="247,26 277,26 277,51 247,51"/>
      <TextLine id="Page1_Block1_l1">
        <Coords points="247,26 275,26 275,49 247,49"/>
        <Word id="Page1_Block1_l1_w1">
          <Coords points="247,26 275,26 275,49 247,49"/>
          <TextEquiv conf="0.1650000066">
            <Unicode>r&gt;</Unicode>
          </TextEquiv>
        </Word>
      </TextLine>
    </TextRegion>
    <ImageRegion id="Page1_Block2">
      <Coords points="476,14 501,14 501,59 476,59"/>
    </ImageRegion>
    <TextRegion id="Page1_Block3">
      <Property key="ComposedBlock" value="Page1_Block4 Page1_Block5"/>
      <Coords points="100,73 476,73 476,123 100,123"/>
    </TextRegion>
    <ImageRegion id="Page1_Block4">
      <Coords points="100,73 148,73 148,113 100,113"/>
    </ImageRegion>
    <TextRegion id="Page1_Block5">
      <Property key="language" value="en-US"/>
      <Coords points="155,75 476,75 476,123 155,123"/>
      <TextLine id="Page1_Block5_l1">
        <Coords points="158,77 471,77 471,93 158,93"/>
        <Word id="Page1_Block5_l1_w1">
          <Coords points="158,77 171,77 171,90 158,90"/>
          <TextEquiv conf="0.4300000072">
            <Unicode>B</Unicode>
          </TextEquiv>
        </Word>
        <Word id="Page1_Block5_l1_w2">
          <Coords points="175,77 210,77 210,91 175,91"/>
          <TextEquiv conf="0.6600000262">
            <Unicode>AT</Unicode>
          </TextEquiv>
        </Word>
        <Word id="Page1_Block5_l1_w3">
          <Coords points="214,77 262,77 262,92 214,92"/>
          <TextEquiv conf="0.55400002">
            <Unicode>(U.K.</Unicode>
          </TextEquiv>
        </Word>
        <Word id="Page1_Block5_l1_w4">
          <Coords points="267,82 301,82 301,91 267,91"/>
          <TextEquiv conf="0.4833333194">
            <Unicode>and</Unicode>
          </TextEquiv>
        </Word>
        <Word id="Page1_Block5_l1_w5">
          <Coords points="307,77 404,77 404,93 307,93"/>
          <TextEquiv conf="0.4828571379">
            <Unicode>EXPORT)</Unicode>
          </TextEquiv>
        </Word>
        <Word id="Page1_Block5_l1_w6">
          <Coords points="408,83 471,83 471,92 408,92"/>
          <TextEquiv conf="0.3557142913">
            <Unicode>limited</Unicode>
          </TextEquiv>
        </Word>
      </TextLine>
      <TextLine id="Page1_Block5_l2">
        <Coords points="158,110 471,110 471,121 158,121"/>
        <Word id="Page1_Block5_l2_w1">
          <Coords points="158,110 201,110 201,120 158,120"/>
          <TextEquiv conf="0.5533333421">
            <Unicode>Export</Unicode>
          </TextEquiv>
        </Word>
        <Word id="Page1_Block5_l2_w2">
          <Coords points="205,110 242,110 242,119 205,119"/>
          <TextEquiv conf="0.3759999871">
            <Unicode>House</Unicode>
          </TextEquiv>
        </Word>
        <Word id="Page1_Block5_l2_w3">
          <Coords points="250,110 297,110 297,120 250,120"/>
          <TextEquiv conf="0.2683333457">
            <Unicode>Woking</Unicode>
          </TextEquiv>
        </Word>
        <Word id="Page1_Block5_l2_w4">
          <Coords points="305,110 347,110 347,121 305,121"/>
          <TextEquiv conf="0.6050000191">
            <Unicode>Surrey</Unicode>
          </TextEquiv>
        </Word>
        <Word id="Page1_Block5_l2_w5">
          <Coords points="351,110 412,110 412,120 351,120"/>
          <TextEquiv conf="0.4314285815">
            <Unicode>GU211YB</Unicode>
          </TextEquiv>
        </Word>
        <Word id="Page1_Block5_l2_w6">
          <Coords points="420,110 471,110 471,121 420,121"/>
          <TextEquiv conf="0.3928571343">
            <Unicode>England</Unicode>
          </TextEquiv>
        </Word>
      </TextLine>
    </TextRegion>
</Page>

【问题讨论】:

  • 包括您尝试过的代码,以及示例 XML 的准确预期结果。

标签: python-3.x xml xml-parsing lxml elementtree


【解决方案1】:

阅读 xpaths 可能会对您有所帮助。 https://www.w3schools.com/xml/xpath_intro.asp

虽然您的目标在问题中有些不清楚,但以下是如何在 &lt;TextRegion&gt; 标记中提取 &lt;Unicode&gt; 标记中的所有文本。

In [1]: from lxml import etree

In [2]: root = etree.parse("page.xml")

In [3]: root.xpath('//TextRegion//Unicode/text()')
Out[3]: 
['r>',
 'B',
 'AT',
 '(U.K.',
 'and',
 'EXPORT)',
 'limited',
 'Export',
 'House',
 'Woking',
 'Surrey',
 'GU211YB',
 'England']

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2019-11-27
    • 2018-04-24
    • 2016-05-10
    • 2016-09-10
    • 2011-12-31
    • 2016-03-27
    相关资源
    最近更新 更多