【问题标题】:strip away html tags from extracted links从提取的链接中去除 html 标签
【发布时间】:2013-12-15 22:50:40
【问题描述】:

我有以下代码可以从网页中提取某些链接:

from bs4 import BeautifulSoup 
import urllib2, sys 
import re 

def tonaton(): 
    site = "http://tonaton.com/en/job-vacancies-in-ghana" 
    hdr = {'User-Agent' : 'Mozilla/5.0'} 
    req = urllib2.Request(site, headers=hdr) 
    jobpass = urllib2.urlopen(req) 
    invalid_tag = ('h2') 
    soup = BeautifulSoup(jobpass) 
    print soup.find_all('h2') 

链接包含在“h2”标签中,所以我得到的链接如下:

<h2><a href="/en/cashiers-accra">cashiers </a></h2> 
<h2><a href="/en/cake-baker-accra">Cake baker</a></h2> 
<h2><a href="/en/automobile-technician-accra">Automobile Technician</a></h2> 
<h2><a href="/en/marketing-officer-accra-4">Marketing Officer</a></h2> 

但我有兴趣摆脱所有“h2”标签,以便我只能以这种方式获得链接:

<a href="/en/cashiers-accra">cashiers </a> 
<a href="/en/cake-baker-accra">Cake baker</a> 
<a href="/en/automobile-technician-accra">Automobile Technician</a> 
<a href="/en/marketing-officer-accra-4">Marketing Officer</a> 

因此,我将代码更新为如下所示:

def tonaton(): 
    site = "http://tonaton.com/en/job-vacancies-in-ghana" 
    hdr = {'User-Agent' : 'Mozilla/5.0'} 
    req = urllib2.Request(site, headers=hdr) 
    jobpass = urllib2.urlopen(req) 
    invalid_tag = ('h2') 
    soup = BeautifulSoup(jobpass) 
    jobs = soup.find_all('h2') 
    for tag in invalid_tag: 
        for match in jobs(tag): 
            match.replaceWithChildren() 
    print jobs 

但我无法让它发挥作用,尽管我认为这是我能想到的最佳逻辑。不过我是新手,所以我知道还有更好的方法可以做。

任何帮助将不胜感激

谢谢

【问题讨论】:

    标签: python html hyperlink beautifulsoup


    【解决方案1】:

    您可以导航到每个&lt;h2&gt; 标签的下一个元素:

    for h2 in soup.find_all('h2'):
        n = h2.next_element
        if n.name == 'a':  print n
    

    它产生:

    <a href="/en/financial-administrator-accra-1">Financial Administrator</a>
    <a href="/en/house-help-accra-17">House help</a>
    <a href="/en/office-manager-accra-1">Office Manager </a>
    ...
    

    【讨论】:

      猜你喜欢
      • 2010-09-07
      • 1970-01-01
      • 1970-01-01
      • 2020-03-26
      • 2022-11-20
      • 2014-11-16
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多