【发布时间】:2012-03-31 10:28:37
【问题描述】:
我正在使用一个简单的 HTMLParser 来解析一个网页,该网页的代码总是格式正确(它是自动生成的)。它运行良好,直到它遇到一条带有“&”号的数据 - 它似乎认为这使它成为两个独立的数据并分别处理它们。 (也就是说,它调用了两次“handle_data”。)我起初认为取消转义 '&' 会解决问题,但我认为它不会。有没有人对我如何让我的解析器处理,例如“Paradise Bakery and Cafe”(即“Paradise Bakery & Café”)作为单个数据项而不是两个数据项有任何建议?
非常感谢, bsg
附:请不要告诉我我真的应该使用 BeautifulSoup。我知道。但是在这种情况下,我知道标记每次都能保证格式正确,而且我发现 HTMLParser 比 BeautifulSoup 更容易使用。谢谢。
我正在添加我的代码 - 谢谢!
#this class, extending HTMLParser, is written to process HTML within a <ul>.
#There are 6 <a> elements nested within each <li>, and I need the data from the second
#one. Whenever it encounters an <li> tag, it sets the 'is_li' flag to true and resets
#the count of a's seen to 0; whenever it encounters an <a> tag, it increments the count
#by 1. When handle_data is called, it checks to make sure that the data is within
#1)an li element and 2) an a element, and that the a element is the second one in that
#li (num_as == 2). If so, it adds the data to the list.
class MyHTMLParser(HTMLParser):
pages = []
is_li = 'false'
#is_li
num_as = 0
def _init_(self):
HTMLParser._init_(self)
self.pages = []
self.is_li = 'false'
self.num_as = 0
self.close_a = 'false'
sel.close_li = 'false'
print "initialized"
def handle_starttag(self, tag, attrs):
if tag == 'li':
self.is_li = 'true'
self.close_a = 'false'
self.close_li = 'false'
if tag == 'a' and self.is_li == 'true':
if self.num_as < 7:
self.num_as += 1
self.close_a = 'false'
else:
self.num_as = 0
self.is_li = 'false'
def handle_endtag(self, tag):
if tag == 'a':
self.close_a = 'true'
if tag == 'li':
self.close_li = 'true'
self.num_as = 0
def handle_data(self, data):
if self.is_li == 'true':
if self.num_as == 2 and self.close_li == 'false' and self.close_a == 'false':
print "found data", data
self.pages.append(data)
def get_pages(self):
return self.pages
【问题讨论】:
标签: python escaping html-parsing