【发布时间】:2011-05-21 05:42:08
【问题描述】:
我正在编写一个简单的脚本来从here 获取灰色大表。
我的代码如下:
import urllib2
from lxml import etree
html = urllib2.urlopen("http://www.afi.com/100years/movies10.aspx").read()
root = etree.XML(html)
但我在最后一条语句中遇到了错误。
Traceback (most recent call last):
File "D:\Workspace\afi100\afi100.py", line 13, in <module>
root = etree.XML(html)
File "lxml.etree.pyx", line 2720, in lxml.etree.XML (src/lxml/lxml.etree.c:52577)
File "parser.pxi", line 1556, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:79602)
File "parser.pxi", line 1435, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:78449)
File "parser.pxi", line 943, in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:75099)
File "parser.pxi", line 547, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:71467)
File "parser.pxi", line 628, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:72340)
File "parser.pxi", line 568, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:71683)
XMLSyntaxError: Space required after the Public Identifier, line 3, column 59
知道如何解决这个错误吗?
谢谢。
【问题讨论】:
-
您认为使用 XML 解析器解析 HTML 是个好主意吗?
-
您应该使用任何可用的 HTML 到 XML (xhtml) 工具。
-
我误以为 HTML 是 XML 的子集(它不是,但 XHTML 是)。 techforum4u.com/content.php/… 对主要差异有很好的描述
标签: python html-parsing lxml