【问题标题】:Breaking apart xml tags with the lxml module使用 lxml 模块分解 xml 标签
【发布时间】:2017-01-30 18:49:34
【问题描述】:

我正在尝试拆分标签(在一行上),以便我可以在不同的行上打印不同的标签和内容,同时忽略<br>

我试图分解的几行内容的片段。

<stats>+40 Ability Power<br>+25 Magic Resist<br>+20% Cooldown Reduction<br><mana>+75% Base Mana Regen </mana></stats><br><br><unique>UNIQUE Passive:</unique> Gain 20% of the <a href='premitigation'><font color='#6666FF'><u>premitigation</u></font></a> damage dealt to champions as Blood Charges, up to <levelScale>100 - 250</levelScale>  max. Healing or shielding another ally consumes charges to heal them, up to the original effect amount.<br><unique>UNIQUE Passive - Harmony:</unique> Grants bonus % Base Health Regen equal to your bonus % Base Mana Regen.<br><br><rules>(Maximum amount of Blood Charges stored is based on level. Healing amplification is applied to the total heal value.)</rules>
<stats>+10% Critical Strike Chance</stats>
<stats>+45 Attack Damage<br>+10% Life Steal</stats><br><br><unique>UNIQUE Passive:</unique> Basic attacks grant +6 Attack Damage and +1% Life Steal for 8 seconds on hit (effect stacks up to 5 times).
<stats>+300 Health<br>+50 Attack Damage<br>+20% Cooldown Reduction</stats><br><br><unique>UNIQUE Passive:</unique> Dealing physical damage to an enemy champion Cleaves them, reducing their Armor by 5% for 6 seconds (stacks up to 6 times, up to 30%).<br><unique>UNIQUE Passive - Rage:</unique> Dealing physical damage grants 20 movement speed for 2 seconds. Assists on Cleaved enemy champions or kills on any unit grant 60 movement speed for 2 seconds instead. This Movement Speed is halved for ranged champions.

但是当我尝试从 lxml 模块解析类似 xml 的字符串时。

root = etree.Element(string)

它给了我错误:

ValueError: Invalid tag name "<stats>+40 Attack Damage<br>+80 Ability Power</stats><br><br><unique>UNIQUE Passive:</unique> Heal for 15% of damage dealt. This is 33% as effective for Area of Effect damage.<br><active>UNIQUE Active - Lightning Bolt:</active> Deals 250 (+30% of Ability Power) magic damage and slows the target champion's Movement Speed by 40% for 2 seconds (40 second cooldown, shared with other <font color='#9999FF'><a href='itembolt'>Hextech</a></font> items)."

【问题讨论】:

    标签: python xml lxml


    【解决方案1】:

    您显示的输入不是 XML。 XML 解析器将无法读取此内容。

    使用 HTML 解析器,它们接受更广泛的输入。见http://lxml.de/parsing.html#parsing-html

    from lxml import etree
    from io import StringIO
    
    test_string = "<stats>+40 Ability Power<br>+25 Magic Resist<br>+20% Cooldown Reduction<br><mana>+75% Base Mana Regen </mana></stats><br><br><unique>UNIQUE Passive:</unique> Gain 20% of the <a href='premitigation'><font color='#6666FF'><u>premitigation</u></font></a> damage dealt to champions as Blood Charges, up to <levelScale>100 - 250</levelScale>  max. Healing or shielding another ally consumes charges to heal them, up to the original effect amount.<br><unique>UNIQUE Passive - Harmony:</unique> Grants bonus % Base Health Regen equal to your bonus % Base Mana Regen.<br><br><rules>(Maximum amount of Blood Charges stored is based on level. Healing amplification is applied to the total heal value.)</rules>"
    html_file = StringIO(test_string)
    
    parser = etree.HTMLParser()
    doc = etree.parse(html_file, parser)
    
    print(doc)
    # prints: <lxml.etree._ElementTree object at 0x03036A58>
    

    之后,您可以使用 lxml 提供的所有方法搜索和修改文档树。

    注意:StringIO 仅在上面的示例中是必需的,以使字符串可以像文件一样使用。如果您已有文件,则不需要StringIO。直接使用文件即可。

    如果您实际上正在寻找网络抓取解决方案,请查看执行此操作的库。 ScrapypyQuery 浮现在脑海中,它们会为您完成所有解析并提供更好的界面来访问页面上的数据。

    【讨论】:

      猜你喜欢
      • 2013-11-19
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2016-08-27
      • 1970-01-01
      • 2011-12-16
      • 1970-01-01
      相关资源
      最近更新 更多