【问题标题】:How to parse text from html using lxml?如何使用 lxml 解析来自 html 的文本?
【发布时间】:2012-12-06 14:35:14
【问题描述】:
<p>
    Glassware veteran
    <strong>Corning </strong>
    (
    <span class="ticker">
      NYSE:
      <a class="qsAdd qs-source-isssitthv0000001" href="http://caps.fool.com/Ticker/GLW.aspx?source=isssitthv0000001" data-id="203758">GLW</a>
    </span>
    <a class="addToWatchListIcon qsAdd qs-source-iwlsitbut0000010" href="http://my.fool.com/watchlist/add?ticker=&source=iwlsitbut0000010" title="Add to My Watchlist"> </a>
    ) has fallen on hard times lately. Is it time to give up on the stock, or will Corning have a banana and a comeback?
</p>

想入“Glassware 老手”,“最近陷入困境。是时候放弃股票了,还是康宁会一蹶不振,卷土重来?”

使用代码

tnode = root.xpath("/p")
content = tnode.text

我只能得到“玻璃器老手”,为什么?

【问题讨论】:

    标签: python text xpath lxml


    【解决方案1】:

    这样的事情可能会得到你想要的:

    >>> tnode = root.xpath('/p')
    >>> content = tnode.xpath('text()')
    >>> print ''.join(content)
    
    Glassware veteran
    
    (
    
    
    ) has fallen on hard times lately. Is it time to give up on the stock, or will Corning have a banana and a comeback?
    >>>
    

    如果您想要所有个文本节点,只需使用//text() 而不是text()

    >>> print ' '.join([x.strip() for x in ele.xpath('//text()')])
    Glassware veteran Corning ( NYSE: GLW    ) has fallen on hard times lately. Is it time to give up on the stock, or will Corning have a banana and a comeback?
    

    【讨论】:

    • 非常感谢。但现在我有一个新问题,我想得到“玻璃器老将康宁(纽约证券交易所代码:GLW)最近陷入困境。是时候放弃股票了,或者康宁会一帆风顺,卷土重来?”使用代码: tnode = root.xpath('/p | /p/strong | /p/a | /p/span') content = tnode.xpath('text()') print ''.join(content)结果是“Glassware 老手( )最近陷入困境。是时候放弃股票了,还是康宁会有香蕉并卷土重来?康宁纽约证券交易所代码:GLW ”您有什么想法吗?谢谢。
    猜你喜欢
    • 1970-01-01
    • 2011-04-03
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2012-01-18
    • 2013-12-23
    • 2012-08-23
    相关资源
    最近更新 更多