1. 在google浏览器中输入maoyan.com, 点击榜单top100.
2.观察分页路由,构造分页url = \'http://maoyan.com/board/4?offset=\' + str(offset)
3.卡发者选项,查看排行的电影信息,我们要爬取电影的排行(index), 图片的url, 标题(title), 演员, 上映时间, 评分。
4.获取首页的html代码,
1 user_agent = \'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 \' \ 2 \' (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36\' 3 headers = {\'User-Agent\': user_agent} 4 5 def get_one_page(url): 6 try: 7 response = requests.get(url, headers=headers) 8 if response.status_code == 200: 9 return response.text 10 return None 11 except RequestException: 12 return None
5. 解析页面,提取数据
1 def parse_one_page(html): 2 soup = BeautifulSoup(html, \'lxml\') 3 items = soup.select(\'dd\') 4 if items: 5 for item in items: 6 yield { 7 \'index\': item.find(\'i\').text, 8 \'image\': item.find(\'img\', class_="board-img").get(\'data-src\'), 9 \'title\': item.find(\'p\').text, 10 \'actor\': item.find(\'p\', class_="star").text.strip()[3:], 11 \'time\': item.find(\'p\', class_="releasetime").text.strip()[5:], 12 \'score\': item.find(\'i\', class_="integer").text + item.find(\'i\', class_="fraction").text 13 }
6. 爬虫主函数
1 def main(offset): 2 url = \'http://maoyan.com/board/4?\' + str(offset) 3 html = get_one_page(url) 4 for item in parse_one_page(html): 5 print(item) 6 write_to_file(item)
7. 开启多进程
1 if __name__ == \'__main__\': 2 pool = Pool() 3 pool.map(main, [i*10 for i in range(10)])
完整代码:https://github.com/huazhicai/Spider/tree/master/maoyantop