【问题标题】:Parsing forum posts using lxml/python使用 lxml/python 解析论坛帖子
【发布时间】:2015-03-04 16:17:56
【问题描述】:

当我使用下面的代码时,它将一个 div 拆分为数组中的十五个项目。问题是我希望这篇文章作为数组中的一项。这可能是因为<br> 标签而发生的,但我不知道如何解决。

from lxml import html
import requests

page = requests.get('http://www.city-data.com/forum/economics/2056372-minimum-wage-vs-liveable-wage.html')

tree = html.fromstring(page.text)

details = tree.xpath('//div[contains(@id, "post_message_33583236")]/text()')

print len(details) #prints 15

【问题讨论】:

    标签: python parsing web-scraping lxml lxml.html


    【解决方案1】:

    用xpath(不是文本)找到元素并使用text_content()方法:

    details = tree.xpath('.//div[contains(@id, "post_message_33583236")]')[0]
    print(details.text_content())
    

    打印:

    With all the talk about raising the minimum wage, I think the real issue is that people are not getting a liveable wage anymore.  This applies to many skilled people too in which their job tries to pay them $10-13hr for $20-30hr type of work.
    
    Not everyone deserves a raise at walmart or other low paying jobs.  I  think everyone should atleast prove themselves for 6 months to year then  start to gradually get a raise. You cant act a fool and get paid the same as people who work hard and try to move up in life. Even if walmart workers weren't making minimum wage and making  $11hr, you cant really do much making 22k a year other than live in a  cheap/borderline crime infested area
    
    $11hr gets you about $1250 a month after taxes and health coverage at most jobs and ill list just the basic necessities in life
    ...
    

    【讨论】:

    • 感谢您的回复。我还有一个问题,当我写 print len(details) 时,它会打印 24。这是为什么呢?
    • @Simon 这是这个节点有多少孩子。如果它有帮助,请不要忘记接受答案。谢谢。
    猜你喜欢
    • 2023-03-22
    • 1970-01-01
    • 2016-12-09
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2017-07-22
    相关资源
    最近更新 更多