【问题标题】:get contents of <a> tags using python使用python获取<a>标签的内容
【发布时间】:2010-06-29 22:14:31
【问题描述】:

假设我有 html 像这样读入我的程序:

<p><a href="http://vancouver.en.craigslist.ca/nvn/ret/1817849271.html">F/T &amp; P/T Sales Associate - Caliente Fashions</a> - <font size="-1"> (North Vancouver)</font></p>
<p><a href="http://vancouver.en.craigslist.ca/van/ret/1817804151.html">IMMEDIATE EMPLOYMENT WANTED!</a> - </p>

<p><a href="http://vancouver.en.craigslist.ca/nvn/ret/1817796152.html">TRAVEL AGENT</a> - <font size="-1"> (NORTH VANCOUVER)</font></p>
<p><a href="http://vancouver.en.craigslist.ca/bnc/ret/1817775400.html">Optical Sales Position</a> - <font size="-1"> (New Westminster)</font></p>
<p><a href="http://vancouver.en.craigslist.ca/van/ret/1817709780.html">Sales Clerk</a> - <font size="-1"> (Kits)</font></p>

<p><a href="http://vancouver.en.craigslist.ca/van/ret/1817676850.html">MARINE SALES</a> - <font size="-1"> (VANCOUVER ( KITS ))</font></p>
<p><a href="http://vancouver.en.craigslist.ca/van/ret/1817608506.html">Retail Sales Associate</a> - <font size="-1"> (Vancouver)</font></p>
<p><a href="http://vancouver.en.craigslist.ca/van/ret/1817573985.html">Retail with small parts appliance background</a> - </p>
<p><a href="http://vancouver.en.craigslist.ca/rds/ret/1817540938.html">Manager *Enjoyable work atmosphere</a> - <font size="-1"> (Langley Centre)</font></p>

<p><a href="http://vancouver.en.craigslist.ca/bnc/ret/1817403652.html">Team Member - Retail Store - FT</a> - <font size="-1"> (Burnaby South)</font></p>
<p><a href="http://vancouver.en.craigslist.ca/rds/ret/1817459155.html">STORE MANAGER-SHOE WAREHOUSE</a> - <font size="-1"> (South Surrey-Semiahmoo)</font></p>
<p><a href="http://vancouver.en.craigslist.ca/pml/ret/1817448777.html">Retail Sales</a> - <font size="-1"> (Coquitlam)</font></p>

如何获取文本节点的内容?我想结束的是在终端中打印类似于此行的内容:

http://vancouver.en.craigslist.ca/nvn/ret/1817849271.html - TRAVEL AGENT

到目前为止,我有以下代码可以很好地提取 href 链接,但我不确定如何提取数据本身。我正在考虑从 sgmllib.py 模块中覆盖 handle_data(self, data),但到目前为止我似乎想不出办法。

from sgmllib import SGMLParser

class URLLister(SGMLParser):
    def reset(self):
        SGMLParser.reset(self)
        self.urls = []

    def start_a(self, attrs):
        href = [v for k, v in attrs if k == "href"]
        if href:
            self.urls.extend(href)

谢谢!

【问题讨论】:

标签: python html-parsing sgml


【解决方案1】:

最简单的可能是BeautifulSoup(请务必使用 3.0.8 或更高版本的3.0.* 版本,不是 3.1.*,除非您使用的是 Python 3——请参阅here! )。

import BeautifulSoup
soup = BeautifulSoup.BeautifulSoup(thehtmlstring)

for anchor in soup.findAll('a'):
  print anchor['href'], anchor.string

BeautifulSoup 生成 unicode 字符串——如果这是个问题,请确保按照您希望的方式对它们进行编码,以按照您想要的方式获取字节字符串!

【讨论】:

  • 如果遵循这个真的不要使用 3.1.*(我应该在深入研究之前阅读所有内容):)
【解决方案2】:

我个人会使用 lxml。安装后,得到你想要的很简单:

from lxml import html

tree = html.fromstring(open("data.html").read())

print [e.text_content() for e in tree.xpath("//a")]

【讨论】:

    【解决方案3】:

    SGMLParser 在 Python 2.6 中已被弃用,并将在 3.0 中消失。您可能想改用 HTMLParser 模块。我以前从未使用过它(我总是只使用 BeutifulSoup 来处理这类事情),所以我想我会了解它是如何工作的。这是我整理的示例脚本,应该可以满足您的需求。

    #!/usr/bin/env python
    
    from HTMLParser import HTMLParser
    
    class URLParser(HTMLParser):
        def __init__(self):
            self.in_link = False
            self.links = []
            self.current_link = ''
            HTMLParser.__init__(self)
    
        def handle_starttag(self, tag, attrs):
            if tag == 'a':
                self.current_link = self.get_href_from_attrs(attrs)
                self.in_link = True
    
        def handle_endtag(self, tag):
            if tag == 'a':
                self.links.append(self.current_link)
                self.in_link = False
    
        def handle_data(self, data):
            if self.in_link:
                self.current_link = '%s - %s' % (self.current_link, data)
    
        def get_href_from_attrs(self, attrs):
            # The attrs dict is a list of tuples like:
            #  [('href', 'www.google.com'), ('class', 'some-class')]
            for prop, val in attrs:
                if prop == 'href':
                    return val
            return ''
    
    if __name__ == '__main__':
        the_html = '''
    <p><a href="http://vancouver.en.craigslist.ca/nvn/ret/1817849271.html">F/T &amp; P/T Sales Associate - Caliente Fashions</a> - <font size="-1"> (North Vancouver)</font></p>
    <p><a href="http://vancouver.en.craigslist.ca/van/ret/1817804151.html">IMMEDIATE EMPLOYMENT WANTED!</a> - </p>
    
    <p><a href="http://vancouver.en.craigslist.ca/nvn/ret/1817796152.html">TRAVEL AGENT</a> - <font size="-1"> (NORTH VANCOUVER)</font></p>
    <p><a href="http://vancouver.en.craigslist.ca/bnc/ret/1817775400.html">Optical Sales Position</a> - <font size="-1"> (New Westminster)</font></p>
    <p><a href="http://vancouver.en.craigslist.ca/van/ret/1817709780.html">Sales Clerk</a> - <font size="-1"> (Kits)</font></p>
    
    <p><a href="http://vancouver.en.craigslist.ca/van/ret/1817676850.html">MARINE SALES</a> - <font size="-1"> (VANCOUVER ( KITS ))</font></p>
    <p><a href="http://vancouver.en.craigslist.ca/van/ret/1817608506.html">Retail Sales Associate</a> - <font size="-1"> (Vancouver)</font></p>
    <p><a href="http://vancouver.en.craigslist.ca/van/ret/1817573985.html">Retail with small parts appliance background</a> - </p>
    <p><a href="http://vancouver.en.craigslist.ca/rds/ret/1817540938.html">Manager *Enjoyable work atmosphere</a> - <font size="-1"> (Langley Centre)</font></p>
    
    <p><a href="http://vancouver.en.craigslist.ca/bnc/ret/1817403652.html">Team Member - Retail Store - FT</a> - <font size="-1"> (Burnaby South)</font></p>
    <p><a href="http://vancouver.en.craigslist.ca/rds/ret/1817459155.html">STORE MANAGER-SHOE WAREHOUSE</a> - <font size="-1"> (South Surrey-Semiahmoo)</font></p>
    <p><a href="http://vancouver.en.craigslist.ca/pml/ret/1817448777.html">Retail Sales</a> - <font size="-1"> (Coquitlam)</font></p>
        '''
        url_parser = URLParser()
        url_parser.feed(the_html)
    
        print '\n'.join(url_parser.links)
    

    输出

    http://vancouver.en.craigslist.ca/nvn/ret/1817849271.html - F/T  -  P/T Sales Associate - Caliente Fashions
    http://vancouver.en.craigslist.ca/van/ret/1817804151.html - IMMEDIATE EMPLOYMENT WANTED!
    http://vancouver.en.craigslist.ca/nvn/ret/1817796152.html - TRAVEL AGENT
    http://vancouver.en.craigslist.ca/bnc/ret/1817775400.html - Optical Sales Position
    http://vancouver.en.craigslist.ca/van/ret/1817709780.html - Sales Clerk
    http://vancouver.en.craigslist.ca/van/ret/1817676850.html - MARINE SALES
    http://vancouver.en.craigslist.ca/van/ret/1817608506.html - Retail Sales Associate
    http://vancouver.en.craigslist.ca/van/ret/1817573985.html - Retail with small parts appliance background
    http://vancouver.en.craigslist.ca/rds/ret/1817540938.html - Manager *Enjoyable work atmosphere
    http://vancouver.en.craigslist.ca/bnc/ret/1817403652.html - Team Member - Retail Store - FT
    http://vancouver.en.craigslist.ca/rds/ret/1817459155.html - STORE MANAGER-SHOE WAREHOUSE
    http://vancouver.en.craigslist.ca/pml/ret/1817448777.html - Retail Sales
    

    更新:在经历了这个小练习之后,这个界面感觉很糟糕,所以我会坚持使用更干净的 BeutifulSoup 库。查看 Alex 的示例以了解它是如何完成的。

    【讨论】:

    • 我喜欢:for k, v in attrs: if k == 'href'; return v
    【解决方案4】:

    只要我们比较选项,这个 pyparsing sn-p 还会为您提供每个位置的位置,在结束 &lt;a&gt; 标记之后的 &lt;font&gt; 标记中给出:

    from pyparsing import makeHTMLTags, SkipTo
    
    a,aEnd = makeHTMLTags("A")
    font,fontEnd = makeHTMLTags("FONT")
    p,pEnd = makeHTMLTags("P")
    
    patt = (p + a("a") + SkipTo(aEnd)("posn") + aEnd + '-' + 
            font + SkipTo(fontEnd)("locn") + fontEnd + pEnd)
    
    for tokens,_,_ in patt.scanString(the_html):
        print tokens.a.href, '-', tokens.posn, tokens.locn
    

    给予:

    http://vancouver.en.craigslist.ca/nvn/ret/1817849271.html - F/T &amp; P/T Sales Associate - Caliente Fashions (North Vancouver)
    http://vancouver.en.craigslist.ca/nvn/ret/1817796152.html - TRAVEL AGENT (NORTH VANCOUVER)
    http://vancouver.en.craigslist.ca/bnc/ret/1817775400.html - Optical Sales Position (New Westminster)
    http://vancouver.en.craigslist.ca/van/ret/1817709780.html - Sales Clerk (Kits)
    http://vancouver.en.craigslist.ca/van/ret/1817676850.html - MARINE SALES (VANCOUVER ( KITS ))
    http://vancouver.en.craigslist.ca/van/ret/1817608506.html - Retail Sales Associate (Vancouver)
    http://vancouver.en.craigslist.ca/rds/ret/1817540938.html - Manager *Enjoyable work atmosphere (Langley Centre)
    http://vancouver.en.craigslist.ca/bnc/ret/1817403652.html - Team Member - Retail Store - FT (Burnaby South)
    http://vancouver.en.craigslist.ca/rds/ret/1817459155.html - STORE MANAGER-SHOE WAREHOUSE (South Surrey-Semiahmoo)
    http://vancouver.en.craigslist.ca/pml/ret/1817448777.html - Retail Sales (Coquitlam)
    

    【讨论】:

      【解决方案5】:
      #download BeautifulSoup library for python
      from Beautiful import *
      
      fh = open('data.html')
      html = fh.read()
      soup = BeautifulSoup(html)
      
      tags = soup('a')
      
      for tag in tags:
          print tag.contents[0]
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多