【发布时间】:2011-04-10 22:19:51
【问题描述】:
我正在尝试解析标签中的特定信息。
例如在网站上:
我正在尝试解析出非常具体的信息,例如成分。如果您转到页面源,您可以看到存在的信息位于名为
的标签中<h2>Ingredients</h2> 和 <ul class="ingredientsList"> 包含所有实际成分。
我在网上找到了一个 python 程序,可以方便地解析出网站中的超链接。但我想修改它以解析出这些成分。我对 python 不是很精通,但是我将如何修改我的代码以满足我的解析需求?
请详细说明我应该如何执行此操作或提供示例等,将不胜感激,因为我对此不太了解。
代码:
import sgmllib
class MyParser(sgmllib.SGMLParser):
"A simple parser class."
def parse(self, s):
"Parse the given string 's'."
self.feed(s)
self.close()
def __init__(self, verbose=0):
"Initialise an object, passing 'verbose' to the superclass."
sgmllib.SGMLParser.__init__(self, verbose)
self.hyperlinks = []
self.descriptions = []
self.inside_a_element = 0
self.starting_description = 0
def start_a(self, attributes):
"Process a hyperlink and its 'attributes'."
for name, value in attributes:
if name == "href":
self.hyperlinks.append(value)
self.inside_a_element = 1
self.starting_description = 1
def end_a(self):
"Record the end of a hyperlink."
self.inside_a_element = 0
def handle_data(self, data):
"Handle the textual 'data'."
if self.inside_a_element:
if self.starting_description:
self.descriptions.append(data)
self.starting_description = 0
else:
self.descriptions[-1] += data
def get_hyperlinks(self):
"Return the list of hyperlinks."
return self.hyperlinks
def get_descriptions(self):
"Return a list of descriptions."
return self.descriptions
import urllib, sgmllib
# Get something to work with.
f = urllib.urlopen("http://www.epicurious.com/Roast-Chicken-231348")
s = f.read()
# Try and process the page.
# The class should have been defined first, remember.
myparser = MyParser()
myparser.parse(s)
# Get the hyperlinks.
print myparser.get_hyperlinks()
print myparser.get_descriptions()
【问题讨论】: