上一篇多文章,是二级目录,根目录“小说”,二级目录“作品名称”,之后就是小说文件。
本篇改造了部分代码,将目录设置为根目录->作者目录->作品目录->作品章节.txt.
但这并不是本章内容当重点,重点是使用这个爬虫程序抓取当时候,经常会因为网络丢包等原因导致程序中断,
本来想着是循环获取网站状态,然后重新发起请求,结果好像也没什么用。然后在虫师讲selenium的书中看到了多线程,正好就实验下,结果发现,速度很快,cool!
以下代码基本摘自虫师的selenium2
多线程的引用
import threading
方法调用:threading.Thread(target=music, args=('music方法参数1',music方法参数2) )
from time import sleep,ctime
import threading
def music(func,loop):
for i in range(loop):
print('music',func,ctime())
sleep(2)
def movie(func,loop):
for i in range(loop):
print('movie',func,ctime())
sleep(4)
def testOne():
music('简单的歌', 2)
movie('两杆大烟枪', 2)
print('all end', ctime())
def testTwo():
threads = []
t1 = threading.Thread(target=music, args=('喜欢的人',2) )
threads.append(t1)
t2 = threading.Thread(target=movie, args=('搏击俱乐部',2) )
threads.append(t2)
t3= threading.Thread(target=music, args=('喜欢的人2', 2))
threads.append(t3)
for t in threads:
t.start()
for t in threads:
t.join()
print('all end', ctime())
if __name__ == '__main__':
testOne()
#testTwo()
#testThree()
#threadsRun()
t.join方法用来串联线程,可以保证all end 语句在最后打印出来。
创建线程管理类
创建类名时就引入Thread:class MyThread(threading.Thread)
class MyThread(threading.Thread):
def __init__(self, func, args, name):
threading.Thread.__init__(self)
self.func = func
self.args = args
self.name = name
def run(self):
self.func(*self.args)
self:类实例,默认参数
func:调用方法名
args:参数
name:方法+".__name__"
完整代码:
1 class MyThread(threading.Thread): 2 3 def __init__(self, func, args, name): 4 threading.Thread.__init__(self) 5 self.func = func 6 self.args = args 7 self.name = name 8 9 def run(self): 10 self.func(*self.args) 11 12 def super_play(file_,time): 13 for i in range(3): 14 print('play', file_, ctime()) 15 sleep(time) 16 17 18 def time(args): 19 pass 20 21 22 def testThree(): 23 threads = [] 24 lists = {'气球.mp3': 3, '电影.rmvb': 4, 'last.avg' : 2} 25 for file_, time_ in lists.items(): 26 t = MyThread(super_play, (file_, time_), super_play.__name__) 27 threads.append(t) 28 29 files = range(len(lists)) 30 31 for f in files: 32 threads[f].start() 33 for f in files: 34 threads[f].join() 35 36 print('all end', ctime())