【发布时间】:2015-09-22 09:08:12
【问题描述】:
我试图使用他们的事件驱动解析,documented here。使用较小的设置文件对其进行测试工作正常:
>>> settings = open('S:\\Documents\\FacebookData\\html\\settings.htm')
>>> for event, element in ET.iterparse(settings, events=("start", "end")):
print("%5s, %4s, %s" % (event, element.tag, element.text))
成功打印出元素。但是,在开始实际编码过程之前,使用相同的代码和“messages.htm”而不是“settings.htm”来查看它是否工作,结果如下:
Traceback (most recent call last):
File "<pyshell#16>", line 1, in <module>
for event, element in ET.iterparse(source, events=("start", "end")):
File "C:\Program Files (x86)\Python\lib\xml\etree\ElementTree.py", line 1294, in __next__
for event in self._parser.read_events():
File "C:\Program Files (x86)\Python\lib\xml\etree\ElementTree.py", line 1277, in read_events
raise event
File "C:\Program Files (x86)\Python\lib\xml\etree\ElementTree.py", line 1235, in feed
self._parser.feed(data)
File "<string>", line None
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 1, column 6
我想知道这是不是因为 ET 更适合解析 XML 文档?如果是这种情况,并且没有解决方法,那么我又回到了原点。任何有关如何解析此文件以及如何调试的建议将不胜感激!
【问题讨论】:
-
从 lxml 尝试 HTML-Parser。
标签: python html parsing html-parsing elementtree