有经验的老鸟都(未婚的)会在公司附近租房,免受舟车劳顿之苦的同时节约了大把时间;也有些人出于某种原因需要每天披星戴月地游走于公司于家之间,很不幸俺就是这其中一员。由于家和公司离得比较远,我平时在公交车上的时间占据了工作时间段的1/4,再加上杭州一向有中国的拉斯维加斯之称(堵城),每每堵起来,哥都能想象自己成为变形金刚。这段漫长时间我想作为每个程序猿来说是无法忍受的,可是既然短时间无法改变生存的现状,咱就好好利用这段时间吧。所以,我特地买了大屏幕的Note II 以便看pdf,另外耳朵也不能闲着,不过咱不是听英语而是听小说,我在读书的时候就喜欢听广播,特别是说书、相声等,所以我需要大量的有声小说,现在网上这些资源多的很,但是下载页记为麻烦,为了挣取更多的流量和广告点击,这些网站的下载链接都需要打开至少两个以上的网页才能找到真正的链接,甚是麻烦,为了节省整体下载时间,我写了这个小程序,方便自己和大家下载有声小说(当然,还有任何其他类型的资源)
先说明一下,我不是为了爬很多资料和数据,仅仅是为了娱乐和学习,所以这里不会漫无目的的取爬取一个网站的所有链接,而是给定一个小说,比方说我要下载小说《童年》,我会在我听评书网上找到该小说的主页然后用程序下载所有mp3音频,具体做法见下面代码,所有代码都在模块crawler5tps中:
1. 先设定一下start url 和保存文件的目录
1 #-*-coding:GBK-*- 2 import urllib,urllib2 3 import re,threading,os 4 5 6 baseurl = \'http://www.5tps.com\' #base url 7 down2path = \'E:/enovel/\' #saving path 8 save2path = \'\' #saving file name (full path)
2. 从start url 解析下载页面的url
1 def parseUrl(starturl): 2 \'\'\' 3 parse out download page from start url. 4 eg. we can get \'http://www.5tps.com/down/8297_52_1_1.html\' from \'http://www.5tps.com/html/8297.html\' 5 \'\'\' 6 global save2path 7 rDownloadUrl = re.compile(".*?<A href=\\'(/down/\w+\.html)\\'.*") #find the link of download page 8 #rTitle = re.compile("<TITILE>.{4}\s{1}(.*)\s{1}.*</TITLE>") 9 #<TITLE>有声小说 闷骚1 播音:刘涛 全集</TITLE> 10 f = urllib2.urlopen(starturl) 11 totalLine = f.readlines() 12
\'\'\' create the name of saving file \'\'\' 13 title = totalLine[3].split(" ")[1] 14 if os.path.exists(down2path+title) is not True: 15 os.mkdir(down2path+title) 16 save2path = down2path+title+"/" 17 18 downUrlLine = [ line for line in totalLine if rDownloadUrl.match(line)] 19 downLoadUrl = []; 20 for dl in downUrlLine: 21 while True: 22 m = rDownloadUrl.match(dl) 23 if not m: 24 break 25 downUrl = m.group(1) 26 downLoadUrl.append(downUrl.strip()) 27 dl = dl.replace(downUrl,\'\') 28 return downLoadUrl
3. 从下载页面解析出真正的下载链接
1 def getDownlaodLink(starturl): 2 \'\'\' 3 find out the real download link from download page. 4 eg. we can get the download link \'http://180j-d.ysts8.com:8000/人物纪实/童年/001.mp3?\ 5 1251746750178x1356330062x1251747362932-3492f04cf54428055a110a176297d95a\' from \ 6 \'http://www.5tps.com/down/8297_52_1_1.html\' 7 \'\'\' 8 downUrl = [] 9 gbk_ClickWord = \'点此下载\' 10 downloadUrl = parseUrl(starturl) 11 rDownUrl = re.compile(\'<a href=\"(.*)\"><font color=\"blue\">\'+gbk_ClickWord+\'.*</a>\') #find the real download link 12 for url in downloadUrl: 13 realurl = baseurl+url 14 print realurl 15 for line in urllib2.urlopen(realurl).readlines(): 16 m = rDownUrl.match(line) 17 if m: 18 downUrl.append(m.group(1)) 19 20 return downUrl
4. 定义下载函数
1 def download(url,filename): 2 \'\'\' download mp3 file \'\'\' 3 print url 4 urllib.urlretrieve(url, filename)
5. 创建用于下载文件的线程类
1 class DownloadThread(threading.Thread): 2 \'\'\' dowanload thread class \'\'\' 3 def __init__(self,func,savePath): 4 threading.Thread.__init__(self) 5 self.function = func 6 self.savePath = savePath 7 8 def run(self): 9 download(self.function,self.savePath)
6. 开始下载
1 if __name__ == \'__main__\': 2 starturl = \'http://www.5tps.com/html/8297.html\' 3 downUrl = getDownlaodLink(starturl) 4 aliveThreadDict = {} # alive thread 5 downloadingUrlDict = {} # downloading link 6 7 i = 0; 8 while i < len(downUrl): 9 \'\'\' Note:我听评说网 只允许同时有三个线程下载同一部小说,但是有时受网络等影响,\ 10 为确保下载的是真实的mp3,这里将线程数设为2 \'\'\' 11 while len(downloadingUrlDict)< 2 : 12 downloadingUrlDict[i]=i 13 i += 1 14 for urlIndex in downloadingUrlDict.values(): 15 #argsTuple = (downUrl[urlIndex],save2path+str(urlIndex+1)+\'.mp3\') 16 if urlIndex not in aliveThreadDict.values(): 17 t = DownloadThread(downUrl[urlIndex],save2path+str(urlIndex+1)+\'.mp3\') 18 t.start() 19 aliveThreadDict[t]=urlIndex 20 for (th,urlIndex) in aliveThreadDict.items(): 21 if th.isAlive() is not True: 22 del aliveThreadDict[th] # delete the thread slot 23 del downloadingUrlDict[urlIndex] # delete the url from url list needed to download 24 25 print \'Completed Download Work\'
这样就可以了,让他尽情的下吧,咱还得码其他的项目去,哎 >>>
等下了班copy到Note中就可以一边听小说一边看资料啦,最后附上源码。