【问题标题】:beautiful soup malformed start tag error美丽的汤格式错误的开始标签错误
【发布时间】:2011-03-08 12:06:27
【问题描述】:
>>> soup = BeautifulSoup( data )
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/usr/lib/pymodules/python2.6/BeautifulSoup.py", line 1499, in __init__
        BeautifulStoneSoup.__init__(self, *args, **kwargs)
      File "/usr/lib/pymodules/python2.6/BeautifulSoup.py", line 1230, in __init__
        self._feed(isHTML=isHTML)
      File "/usr/lib/pymodules/python2.6/BeautifulSoup.py", line 1263, in _feed
        self.builder.feed(markup)
      File "/usr/lib/python2.6/HTMLParser.py", line 108, in feed
        self.goahead(0)
      File "/usr/lib/python2.6/HTMLParser.py", line 148, in goahead
        k = self.parse_starttag(i)
      File "/usr/lib/python2.6/HTMLParser.py", line 226, in parse_starttag
        endpos = self.check_for_whole_start_tag(i)
      File "/usr/lib/python2.6/HTMLParser.py", line 301, in check_for_whole_start_tag
        self.error("malformed start tag")
      File "/usr/lib/python2.6/HTMLParser.py", line 115, in error
        raise HTMLParseError(message, self.getpos())
    HTMLParser.HTMLParseError: malformed start tag, at line 5518, column 822



>>> for each in l[5515:5520]:
...     print each
... 
<script>

  registerImage("original_image", "http://ecx.images-amazon.com/images/I/41h7uHc1jmL._SL500_AA240_.jpg","<a href="+'"'+"http://www.amazon.com/gp/product/images/1592406017/ref=dp_image_0?ie=UTF8&n=283155&s=books"+'"'+" target="+'"'+"AmazonHelp"+'"'+" onclick="+'"'+"return amz_js_PopWin(this.href,'AmazonHelp','width=700,height=600,resizable=1,scrollbars=1,toolbar=0,status=1');"+'"'+"  ><img onload="+'"'+"if (typeof uet == 'function') { uet('af'); }"+'"'+" src="+'"'+"http://ecx.images-amazon.com/images/I/41h7uHc1jmL._SL500_AA240_.jpg"+'"'+" id="+'"'+"prodImage"+'"'+"  width="+'"'+"240"+'"'+" height="+'"'+"240"+'"'+"   border="+'"'+"0"+'"'+" alt="+'"'+"Life, on the Line: A Chef's Story of Chasing Greatness, Facing Death, and Redefining the Way We Eat"+'"'+" onmouseover="+'"'+""+'"'+" /></a>", "<br /><a href="+'"'+"http://www.amazon.com/gp/product/images/1592406017/ref=dp_image_text_0?ie=UTF8&n=283155&s=books"+'"'+" target="+'"'+"AmazonHelp"+'"'+" onclick="+'"'+"return amz_js_PopWin(this.href,'AmazonHelp','width=700,height=600,resizable=1,scrollbars=1,toolbar=0,status=1');"+'"'+"  >See larger image</a>", "");
  var ivStrings = new Object();
</script>
>>> 
>>> l[5518-1][822]
'h'
>>> 

注意:在 ubuntu 10.04 上使用 Python 2.6.5

BeutifulSoup 不应该忽略脚本标签吗?
想不出办法解决这个问题:(
有什么建议吗??

【问题讨论】:

  • 即使是删除所有脚本标签的方法也行得通!对 re.sub 没有运气,后来发现 re 不能用于 html,因为 html 不是常规语言:X

标签: python html-parsing beautifulsoup


【解决方案1】:

Pyparsing 有一些 HTML 标签支持,这使得脚本比直接的 RE 更健壮。而且由于它不会尝试解析/处理整个 HTML 正文,而只是寻找匹配的字符串表达式,因此它可以处理格式错误的 HTML:

html = """<script>    
registerImage("original_image", 
"this is a closing </script> tag in quotes"
etc....
</script>
"""

# code to strip <script> tags from an HTML page
from pyparsing import makeHTMLTags,SkipTo,quotedString

script,scriptEnd = makeHTMLTags("script")
scriptBody = script + SkipTo(scriptEnd, ignore=quotedString) + scriptEnd

descriptedHtml = scriptBody.suppress().transformString(html)

根据您尝试执行的 HTML 抓取类型,您可能可以使用 pyparsing 完成所有操作。

【讨论】:

    【解决方案2】:

    当我经常在 BeautifulSoup 中点击脚本标签时,我会将 soup 对象转换回字符串,删除有问题的数据,然后重新对数据进行 Soup。当您不关心数据时工作。

    【讨论】:

      猜你喜欢
      • 2015-05-15
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2019-01-26
      相关资源
      最近更新 更多