Python：使用 html 解析器提取特定数据答案

【问题标题】：Python: Extracting specific data with html parserPython：使用 html 解析器提取特定数据
【发布时间】：2013-05-22 08:01:21
【问题描述】：

我开始使用 Python 中的 HTMLParser 从网站中提取数据。我得到了我想要的一切，除了 HTML 的两个标签中的文本。这是 HTML 标记的示例：

<a href="http://wold.livingsources.org/vocabulary/1" title="Swahili" class="Vocabulary">Swahili</a>

还有其他以 . 开头的标签。它们有其他属性和值，因此我不想拥有它们的数据：

<a href="http://wold.livingsources.org/contributor#schadebergthilo" title="Thilo Schadeberg" class="Contributor">Thilo Schadeberg</a>

标签是表格中的嵌入标签。我不知道这是否对其他标签有任何影响。我只想要一些名为“a”且属性为 class="Vocabulary" 的标签中的信息，并且我想要标签中的数据，在示例中为“斯瓦希里语”。所以我做的是：

class AllLanguages(HTMLParser):
    '''
    classdocs
    '''
    #counter for the languages
    #countLanguages = 0
    def __init__(self):
        HTMLParser.__init__(self)
        self.inLink = False
        self.dataArray = []
        self.countLanguages = 0
        self.lasttag = None
        self.lastname = None
        self.lastvalue = None
        #self.text = ""


    def handle_starttag(self, tag, attr):
        #print "Encountered a start tag:", tag      
        if tag == 'a':
            for name, value in attr:
                if name == 'class' and value == 'Vocabulary':
                    self.countLanguages += 1
                    self.inLink = True
                    self.lasttag = tag
                    #self.lastname = name
                    #self.lastvalue = value
                    print self.lasttag
                    #print self.lastname
                    #print self.lastvalue
                    #return tag
                    print self.countLanguages




    def handle_endtag(self, tag):
        if tag == "a":
            self.inlink = False
            #print "".join(self.data)

    def handle_data(self, data):
        if self.lasttag == 'a' and self.inLink and data.strip():
            #self.dataArray.append(data)
            #
            print data

程序会打印标签中包含的所有数据，但我只希望标签中包含具有正确属性的数据。如何获取这些特定数据？

【问题讨论】：

标签： python html python-2.7 html-parsing html-parser

【解决方案1】：

您好像忘记在handle_starttag 中默认设置self.inLink = False：

from HTMLParser import HTMLParser


class AllLanguages(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self.inLink = False
        self.dataArray = []
        self.countLanguages = 0
        self.lasttag = None
        self.lastname = None
        self.lastvalue = None

    def handle_starttag(self, tag, attrs):
        self.inLink = False
        if tag == 'a':
            for name, value in attrs:
                if name == 'class' and value == 'Vocabulary':
                    self.countLanguages += 1
                    self.inLink = True
                    self.lasttag = tag

    def handle_endtag(self, tag):
        if tag == "a":
            self.inlink = False

    def handle_data(self, data):
        if self.lasttag == 'a' and self.inLink and data.strip():
            print data


parser = AllLanguages()
parser.feed("""
<html>
<head><title>Test</title></head>
<body>
<a href="http://wold.livingsources.org/vocabulary/1" title="Swahili" class="Vocabulary">Swahili</a>
<a href="http://wold.livingsources.org/contributor#schadebergthilo" title="Thilo Schadeberg" class="Contributor">Thilo Schadeberg</a>
<a href="http://wold.livingsources.org/vocabulary/2" title="English" class="Vocabulary">English</a>
<a href="http://wold.livingsources.org/vocabulary/2" title="Russian" class="Vocabulary">Russian</a>
</body>
</html>""")

打印：

Swahili
English
Russian

另外，看看：

希望对您有所帮助。

【讨论】：

非常感谢。我希望它很小；）。我也尝试了beautifulsoup，这也很完美。再次感谢您的帮助。
不客气。如果有帮助，请考虑接受答案，谢谢！
您有使用特殊解析器的建议吗？我需要 html 文件的数据并想将其写入 xml 文件。你会用哪一个？或者其中一个解析器的优点是什么？
好吧，beautifulspoup 和 lxml 是不错的 html 解析器。 lxml 以速度着称，beautifulsoup 非常方便但不支持 xpath 表达式。查看更多：blog.ianbicking.org/2008/03/30/python-html-parser-performance、stackoverflow.com/questions/3577641/…、stackoverflow.com/questions/6494199/…。
好吧，我必须解析很多数据，因此 beautifulsoup 很慢。但我想我会尝试 lxml。非常感谢

【解决方案2】：

您可以尝试 HTQL (http://htql.net)。查询：

“标签名为'a'，属性为class="Vocabulary"，我想要标签内的数据”

是：

<a (class='Vocabulary')>:tx

python 代码是这样的：

import htql
a=htql.query(page, "<a (class='Vocabulary')>:tx")
print(a)

【讨论】：