【问题标题】:Parsing html and js in python using lxml使用lxml在python中解析html和js
【发布时间】:2014-03-01 02:36:17
【问题描述】:

我在 Python 中使用 lxml 解析 JS 时遇到问题。当我执行下面的代码时,我的输出是:

""

from lxml.html.clean import Cleaner
cleaner = Cleaner()
cleaner.javascript = True 

text = urllib2.urlopen("URL").read().decode("utf-8")
test = lxml.html.fromstring(cleaner.clean_html(text))
print test

我想要得到的是没有 js 东西的解析文本。有人可以解释一下吗?谢谢。

【问题讨论】:

    标签: python parsing lxml


    【解决方案1】:
    import lxml
    import urllib2
    
    URL = "http://www.google.com/"
    ENCODING = "latin1"
    
    args = {
        "javascript": True,         # strip javascript
        "page_structure": False,    # leave page structure alone
        "style": True               # remove CSS styling
    }
    cleaner = lxml.html.clean.Cleaner(**args)
    
    # get the page source
    html = urllib2.urlopen(URL).read().decode(ENCODING)
    # clean it up
    clean = cleaner.clean_html(html)
    
    # print unformatted html dump
    print(clean)
    
    # print properly indented html
    tree = lxml.html.fromstring(clean)
    print(lxml.etree.tostring(tree, pretty_print=True))
    

    请注意,lxml.etree.tostring() 可以正常打印,但 lxml.html.tostring() 打印效果不佳,它会换行但不会缩进 - 看图。

    【讨论】:

      猜你喜欢
      • 2013-01-17
      • 1970-01-01
      • 2013-01-01
      • 2012-07-29
      • 2013-12-23
      • 2012-06-10
      • 1970-01-01
      • 2023-04-01
      • 1970-01-01
      相关资源
      最近更新 更多