Windmill 没有得到所有的 html 内容答案

【问题标题】：Windmill not getting all html contentWindmill 没有得到所有的 html 内容
【发布时间】：2012-03-09 10:33:23
【问题描述】：

我正在尝试使用 python Windmill 框架从网页上抓取数据。但是，我在从页面中获取 HTML 表格内容时遇到问题。该表是由 Javascript 生成的 - 因此我使用 Windmill 来获取内容。但是，内容不会返回表格 - 如果我使用 BeautifulSoup 尝试解析内容，则会导致错误。

from windmill.authoring import WindmillTestClient
from BeautifulSoup import BeautifulSoup

from copy import copy
import re

def get_massage():
    my_massage = copy(BeautifulSoup.MARKUP_MASSAGE)
    my_massage.append((re.compile(u"document.write(.+);"), lambda match: ""))
    my_massage.append((re.compile(u'alt=".+">'), lambda match: ">"))
    return my_massage

def test_scrape():
    my_massage = get_massage()
    client = WindmillTestClient(__name__)
    client.open(url='http://marinetraffic.com/ais/datasheet.aspx?MMSI=636092060&TIMESTAMP=2&menuid=&datasource=POS&app=&mode=&B1=Search')
    client.waits.forPageLoad(timeout='60000')
    html = client.commands.getPageText()
    assert html['status']
    assert html['result']
    soup=BeautifulSoup(html['result'],markupMassage=my_massage)
    print soup.prettify()

当您查看汤的输出时，该表丢失了，但如果您使用诸如萤火虫之类的内容查看网页内容，它就会显示出来。总的来说，我正在尝试获取表格内容并将其解析为某种数据结构以进行进一步处理。非常感谢任何帮助！

【问题讨论】：

标签： python screen-scraping web-scraping beautifulsoup windmill

【解决方案1】：

问题是您正在使用的标记按摩不适用于您正在处理的页面，也就是说，它删除的 html 代码超出了应有的范围。

为了验证BeautifulSoup 是否能够解析您需要的网页，我只是尝试了这个：

soup = BeautifulSoup(html['result'])
soup.table

而且效果很好，所以在这种情况下似乎根本不需要任何标记按摩。

【讨论】：