如何使用 lxml 解析来自 html 的文本？答案

【问题标题】：How to parse text from html using lxml?如何使用 lxml 解析来自 html 的文本？
【发布时间】：2012-12-06 14:35:14
【问题描述】：

<p>
    Glassware veteran
    <strong>Corning </strong>
    (
    <span class="ticker">
      NYSE:
      <a class="qsAdd qs-source-isssitthv0000001" href="http://caps.fool.com/Ticker/GLW.aspx?source=isssitthv0000001" data-id="203758">GLW</a>
    </span>
    <a class="addToWatchListIcon qsAdd qs-source-iwlsitbut0000010" href="http://my.fool.com/watchlist/add?ticker=&source=iwlsitbut0000010" title="Add to My Watchlist"> </a>
    ) has fallen on hard times lately. Is it time to give up on the stock, or will Corning have a banana and a comeback?
</p>

想入“Glassware 老手”，“最近陷入困境。是时候放弃股票了，还是康宁会一蹶不振，卷土重来？”

使用代码

tnode = root.xpath("/p")
content = tnode.text

我只能得到“玻璃器老手”，为什么？

【问题讨论】：

标签： python text xpath lxml

【解决方案1】：

这样的事情可能会得到你想要的：

>>> tnode = root.xpath('/p')
>>> content = tnode.xpath('text()')
>>> print ''.join(content)

Glassware veteran

(


) has fallen on hard times lately. Is it time to give up on the stock, or will Corning have a banana and a comeback?
>>>

如果您想要所有个文本节点，只需使用//text() 而不是text()：

>>> print ' '.join([x.strip() for x in ele.xpath('//text()')])
Glassware veteran Corning ( NYSE: GLW    ) has fallen on hard times lately. Is it time to give up on the stock, or will Corning have a banana and a comeback?

【讨论】：

非常感谢。但现在我有一个新问题，我想得到“玻璃器老将康宁（纽约证券交易所代码：GLW）最近陷入困境。是时候放弃股票了，或者康宁会一帆风顺，卷土重来？”使用代码： tnode = root.xpath('/p | /p/strong | /p/a | /p/span') content = tnode.xpath('text()') print ''.join(content)结果是“Glassware 老手（）最近陷入困境。是时候放弃股票了，还是康宁会有香蕉并卷土重来？康宁纽约证券交易所代码：GLW ”您有什么想法吗？谢谢。