Python HTML解析标签内的特定信息答案

【问题标题】：Python HTML parsing specific information within tagsPython HTML解析标签内的特定信息
【发布时间】：2011-04-10 22:19:51
【问题描述】：

我正在尝试解析标签中的特定信息。

例如在网站上：

http://www.epicurious.com/articlesguides/bestof/toprecipes/bestchickenrecipes/recipes/food/views/My-Favorite-Simple-Roast-Chicken-231348

我正在尝试解析出非常具体的信息，例如成分。如果您转到页面源，您可以看到存在的信息位于名为

的标签中

<h2>Ingredients</h2> 和 <ul class="ingredientsList"> 包含所有实际成分。

我在网上找到了一个 python 程序，可以方便地解析出网站中的超链接。但我想修改它以解析出这些成分。我对 python 不是很精通，但是我将如何修改我的代码以满足我的解析需求？

请详细说明我应该如何执行此操作或提供示例等，将不胜感激，因为我对此不太了解。

代码：

import sgmllib

class MyParser(sgmllib.SGMLParser):
    "A simple parser class."

    def parse(self, s):
        "Parse the given string 's'."
        self.feed(s)
        self.close()

    def __init__(self, verbose=0):
        "Initialise an object, passing 'verbose' to the superclass."

        sgmllib.SGMLParser.__init__(self, verbose)
        self.hyperlinks = []
        self.descriptions = []
        self.inside_a_element = 0
        self.starting_description = 0

    def start_a(self, attributes):
        "Process a hyperlink and its 'attributes'."

        for name, value in attributes:
            if name == "href":
                self.hyperlinks.append(value)
                self.inside_a_element = 1
                self.starting_description = 1

    def end_a(self):
        "Record the end of a hyperlink."

        self.inside_a_element = 0

    def handle_data(self, data):
        "Handle the textual 'data'."

        if self.inside_a_element:
            if self.starting_description:
                self.descriptions.append(data)
                self.starting_description = 0
            else:
                self.descriptions[-1] += data

    def get_hyperlinks(self):
        "Return the list of hyperlinks."

        return self.hyperlinks

    def get_descriptions(self):
        "Return a list of descriptions."

        return self.descriptions

import urllib, sgmllib

# Get something to work with.
f = urllib.urlopen("http://www.epicurious.com/Roast-Chicken-231348")
s = f.read()

# Try and process the page.
# The class should have been defined first, remember.
myparser = MyParser()
myparser.parse(s)

# Get the hyperlinks.
print myparser.get_hyperlinks()
print myparser.get_descriptions()

【问题讨论】：

标签： python html parsing tags

【解决方案1】：

看看http://www.crummy.com/software/BeautifulSoup/你的方法适用于简单的情况，但一旦 html 和/或你的要求变得更复杂一点，就会让你头疼。

【讨论】：

【解决方案2】：

我会从所有说 HTML 文本不能用正则表达式分析的人那里得到一个打勾。

好的，好的，但我在五十分钟后得到了结果：

首先，我使用这段代码来获取网页代码源的方便展示：

import urllib

url = ('http://www.epicurious.com/articlesguides/bestof/'
       'toprecipes/bestchickenrecipes/recipes/food/views/'
       'My-Favorite-Simple-Roast-Chicken-231348')


sock = urllib.urlopen(url)
ch = sock.read()
sock.close()


gen = (str(i)+' '+repr(line) for i,line in enumerate(ch.splitlines(1)))

print '\n'.join(gen)

那么，抓材料就是小菜一碟了：

import urllib
import re

url = ('http://www.epicurious.com/articlesguides/bestof/'
       'toprecipes/bestchickenrecipes/recipes/food/views/'
       'My-Favorite-Simple-Roast-Chicken-231348')

sock = urllib.urlopen(url)
ch = sock.read()
sock.close()

x = ch.find('ul class="ingredientsList">')

patingr = re.compile('<li class="ingredient">(.+?)</li>\n')

print patingr.findall(ch,x)

编辑

阿奇姆，

关于 '\n' 的存在，是我的错，而不是正则表达式工具：我写代码太快了。

关于大写你是对的：BS 仍然找到正确的字符串，而正则表达式失败。但是，我从未见过元素标签以大写形式编写的源代码。你能给我一个这样的链接吗？

关于'或"，都是一样的，我从来没见过，但你说得对，它可能会发生。

但是，在编写 RE 时，如果在某些地方出现大写字母或 ' 而不是 "，则会编写 RE 以匹配它们：问题出在哪里？

你的意思是：如果源代码改变了？更不可能有一天会看到源代码从小写变为大写，或者" 更改为' 的网站。不太现实。

所以，纠正我的 RE 很容易

import urllib
import re

url = ('http://www.epicurious.com/articlesguides/bestof/'
       'toprecipes/bestchickenrecipes/recipes/food/views/'
       'My-Favorite-Simple-Roast-Chicken-231348')

sock = urllib.urlopen(url)
ch = sock.read()
sock.close()

#----------------------------------------------------------
patingr = re.compile('<li class="ingredient">(.+?)</li>\n')
print
print '\n'.join(repr(mat.group()) for mat in patingr.finditer(ch))


ch = ch.replace('<li class="ingredient">One 2- to 3-pound farm-raised chicken</li>',
                "<LI class='ingredient'>One 2- to 3-pound farm-raised \nchicken</li>")
print
print '\n'.join(repr(mat.group()) for mat in patingr.finditer(ch))


patingr = re.compile('<li class=["\']ingredient["\']>(.+?)</li>\n',re.DOTALL|re.IGNORECASE)
print
print '\n'.join(repr(mat.group()) for mat in patingr.finditer(ch))

结果

'<li class="ingredient">One 2- to 3-pound farm-raised chicken</li>\n'
'<li class="ingredient">Kosher salt and freshly ground black pepper</li>\n'
'<li class="ingredient">2 teaspoons minced thyme (optional)</li>\n'
'<li class="ingredient">Unsalted butter</li>\n'
'<li class="ingredient">Dijon mustard</li>\n'

'<li class="ingredient">Kosher salt and freshly ground black pepper</li>\n'
'<li class="ingredient">2 teaspoons minced thyme (optional)</li>\n'
'<li class="ingredient">Unsalted butter</li>\n'
'<li class="ingredient">Dijon mustard</li>\n'

"<LI class='ingredient'>One 2- to 3-pound farm-raised \nchicken</li>\n"
'<li class="ingredient">Kosher salt and freshly ground black pepper</li>\n'
'<li class="ingredient">2 teaspoons minced thyme (optional)</li>\n'
'<li class="ingredient">Unsalted butter</li>\n'
'<li class="ingredient">Dijon mustard</li>\n'

那么，从现在开始，我将始终在标签中添加标志 re.IGNORECASE 和 ["']

还有其他可能发生的“问题”吗？我有兴趣了解他们。

我并不假装在所有情况下都必须使用正则表达式，而解析器永远不会，我只是认为，如果以受控和分隔的方式使用正则表达式的条件得到验证，它们会非常有趣，并且它将是一个可惜忽略了他们。

顺便说一句，你没有提到正则表达式比 BeautifulSoup 快得多的事实。见time comparison between regex an BeautifulSoup

【讨论】：

@Ryan Matthew 我看到你接受了我的回答。美好的。你知道你也可以投票吗？ - 请注意，给定站点中源代码从一个页面到另一个页面的变化可能会导致失败；必须在多次运行期间观察结果，并且在可能的情况下添加验证 sn-ps 是一个很好的做法。如果您还有其他问题，我很乐意为您解答。
嗨 eyquem，这也是我想出来的。所以我也将尝试使用 Beautiful Soup，因为切换到不同的页面源代码时可能会更容易
@Ryan Matthew that might be easier when switching to a different page source code 我也有同样的印象。但这并不是绝对明显的。顺便说一句，请参阅我的另一个答案中的编辑：至少在我测试的示例中，正则表达式速度更快，有 3 个数量级（我的意思是快 1000 倍）
如果
包含换行符，则您的正则表达式会中断。它也会因 html 代码中的最小变化而中断。例如，使用 LI 或单引号根本不会改变文档的语义，但会破坏您的代码。这段代码可能适用于单个文档，但我永远不会在现实生活中使用类似的东西。更不用说生产代码了。