【发布时间】:2012-04-16 09:39:20
【问题描述】:
我是 python 新手,我正在开发一个网络爬虫,下面是从给定 url 获取链接的程序,但问题是我不希望它访问已经访问过的相同 url。请帮帮我。
import re
import urllib.request
import sqlite3
db = sqlite3.connect('test2.db')
db.row_factory = sqlite3.Row
db.execute('drop table if exists test')
db.execute('create table test(id INTEGER PRIMARY KEY,url text)')
#linksList = []
#module to vsit the given url and get the all links in that page
def get_links(urlparse):
try:
if urlparse.find('.msi') ==-1: #check whether the url contains .msi extensions
htmlSource = urllib.request.urlopen(urlparse).read().decode("iso-8859-1")
#parsing htmlSource and finding all anchor tags
linksList = re.findall('<a href=(.*?)>.*?</a>',htmlSource) #returns href and other attributes of a tag
for link in linksList:
start_quote = link.find('"') # setting start point in the link
end_quote = link.find('"', start_quote + 1) #setting end point in the link
url = link[start_quote + 1:end_quote] # get the string between start_quote and end_quote
def concate(url): #since few href may return only /contact or /about so concatenating its baseurl
if url.find('http://'):
url = (urlparse) + url
return url
else:
return url
url_after_concate = concate(url)
# linksList.append(url_after_concate)
try:
if url_after_concate.find('.tar.bz') == -1: # skipping links which containts link to some softwares or downloads page
db.execute('insert or ignore into test(url) values (?)', [url_after_concate])
except:
print("insertion failed")
else:
return True
except:
print("failed")
get_links('http://www.python.org')
cursor = db.execute('select * from test')
for row in cursor: # retrieve the links stored in database
print (row['id'],row['url'])
urlparse = row['url']
# print(linksList)
# if urlparse in linksList == -1:
try:
get_links(urlparse) # again parse the link from database
except:
print ("url error")
请给我建议解决问题的方法。
【问题讨论】:
-
几厘米。您的函数有太多的嵌套级别。将
concate函数移除到get_links之外。此外,它是“连接”。不要使用正则表达式来解析 HTML。使用像BeautifulSoup这样的库。不要使用 catch allexcept:吞下异常并且不打印任何诊断信息。 -
我要继续问:您是否考虑过使用递归 Web 下载器
wget,然后处理wget为您检索的内容? -
@Li-aungYip 不,先生,我没有使用它。但我认为 wget 是从给定的 url 中获取一些内容。在这里我只对获取所有href的值感兴趣。
标签: python urllib web-crawler