【发布时间】:2013-12-15 22:50:40
【问题描述】:
我有以下代码可以从网页中提取某些链接:
from bs4 import BeautifulSoup
import urllib2, sys
import re
def tonaton():
site = "http://tonaton.com/en/job-vacancies-in-ghana"
hdr = {'User-Agent' : 'Mozilla/5.0'}
req = urllib2.Request(site, headers=hdr)
jobpass = urllib2.urlopen(req)
invalid_tag = ('h2')
soup = BeautifulSoup(jobpass)
print soup.find_all('h2')
链接包含在“h2”标签中,所以我得到的链接如下:
<h2><a href="/en/cashiers-accra">cashiers </a></h2>
<h2><a href="/en/cake-baker-accra">Cake baker</a></h2>
<h2><a href="/en/automobile-technician-accra">Automobile Technician</a></h2>
<h2><a href="/en/marketing-officer-accra-4">Marketing Officer</a></h2>
但我有兴趣摆脱所有“h2”标签,以便我只能以这种方式获得链接:
<a href="/en/cashiers-accra">cashiers </a>
<a href="/en/cake-baker-accra">Cake baker</a>
<a href="/en/automobile-technician-accra">Automobile Technician</a>
<a href="/en/marketing-officer-accra-4">Marketing Officer</a>
因此,我将代码更新为如下所示:
def tonaton():
site = "http://tonaton.com/en/job-vacancies-in-ghana"
hdr = {'User-Agent' : 'Mozilla/5.0'}
req = urllib2.Request(site, headers=hdr)
jobpass = urllib2.urlopen(req)
invalid_tag = ('h2')
soup = BeautifulSoup(jobpass)
jobs = soup.find_all('h2')
for tag in invalid_tag:
for match in jobs(tag):
match.replaceWithChildren()
print jobs
但我无法让它发挥作用,尽管我认为这是我能想到的最佳逻辑。不过我是新手,所以我知道还有更好的方法可以做。
任何帮助将不胜感激
谢谢
【问题讨论】:
标签: python html hyperlink beautifulsoup