使用 urllib 时 etree 生成错误答案

【问题标题】：etree generating error when using urlib使用 urllib 时 etree 生成错误
【发布时间】：2016-03-10 11:37:23
【问题描述】：

我正在尝试使用the solutions in this post 将 HTML 表解析为 python (2.7)。当我用字符串尝试前两个中的任何一个时（如示例中所示），它工作得很好。但是，当我尝试在使用 urlib 阅读的 HTML 页面上使用 etree.xml 时，出现错误。我检查了每个解决方案，我传递的变量也是一个 str 。对于以下代码：

from lxml import etree
import urllib
yearurl="http://www.boxofficemojo.com/yearly/chart/?yr=2014&p=.htm"
s=urllib.urlopen(yearurl).read()
print type (s)
table = etree.XML(s)

我收到此错误：

文件“C:/Users/user/PycharmProjects/Wikipedia/TestingFile.py”，行 9、in table = etree.XML(s)

文件“lxml.etree.pyx”，第 2723 行，在 lxml.etree.XML (src/lxml/lxml.etree.c:52448)

文件“parser.pxi”，第 1573 行，在 lxml.etree._parseMemoryDocument 中 (src/lxml/lxml.etree.c:79932)

文件“parser.pxi”，第 1452 行，在 lxml.etree._parseDoc (src/lxml/lxml.etree.c:78774)

文件“parser.pxi”，第 960 行，在 lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:75389)

文件“parser.pxi”，第 564 行，在 lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:71739)

文件“parser.pxi”，第 645 行，在 lxml.etree._handleParseResult (src/lxml/lxml.etree.c:72614)

文件“parser.pxi”，第 585 行，在 lxml.etree._raiseParseError (src/lxml/lxml.etree.c:71955) lxml.etree.XMLSyntaxError: 打开和结束标签不匹配：链接第 8 行和头部，第 8 行，第 48 列

对于这段代码：

from xml.etree import ElementTree as ET
import urllib
yearurl="http://www.boxofficemojo.com/yearly/chart/?yr=2014&p=.htm"
s=urllib.urlopen(yearurl).read()
print type (s)
table = ET.XML(s)

我收到此错误：

Traceback（最近一次调用最后一次）：文件 “C:/Users/user/PycharmProjects/Wikipedia/TestingFile.py”，第 6 行，在表 = ET.XML(s)

文件“C:\Python27\lib\xml\etree\ElementTree.py”，第 1300 行，以 XML 格式 parser.feed(文本)

文件“C:\Python27\lib\xml\etree\ElementTree.py”，第 1642 行，在提要中 self._raiseerror(v)

文件“C:\Python27\lib\xml\etree\ElementTree.py”，第 1506 行，在 _raiseerror raise err xml.etree.ElementTree.ParseError: mismatched tag: line 8, column 111

【问题讨论】：

标签： python python-2.7 html-parsing elementtree

【解决方案1】：

虽然它们可能看起来是相同的标记类型，但 HTML 不像 XML 那样严格，要形成良好的格式并遵循标记规则（打开/关闭节点、转义实体等）。因此，通过 HTML 的内容可能不允许用于 XML。

因此，考虑使用etree的HTML()函数来解析页面。此外，您可以使用 XPath 来定位您打算提取或使用的特定区域。下面是一个尝试拉取主页表的示例。请注意该网页使用了相当多的嵌套表格。

from lxml import etree
import urllib.request as rq
yearurl = "http://www.boxofficemojo.com/yearly/chart/?yr=2014&p=.htm"
s = rq.urlopen(yearurl).read()
print(type(s))

# PARSE PAGE
htmlpage = etree.HTML(s)

# XPATH TO SPECIFIC CONTENT
htmltable = htmlpage.xpath("//table[tr/td/font/a/b='Rank']//text()")

for row in htmltable:
    print(row)

【讨论】：