minorblog

1. 在google浏览器中输入maoyan.com,  点击榜单top100.

2.观察分页路由,构造分页url = \'http://maoyan.com/board/4?offset=\' + str(offset)

3.卡发者选项,查看排行的电影信息,我们要爬取电影的排行(index), 图片的url, 标题(title), 演员, 上映时间, 评分。

4.获取首页的html代码,

 1 user_agent = \'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 \' \
 2             \' (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36\'
 3 headers = {\'User-Agent\': user_agent}
 4 
 5 def get_one_page(url):
 6     try:
 7         response = requests.get(url, headers=headers)
 8         if response.status_code == 200:
 9             return response.text
10         return None
11     except RequestException:
12         return None

5. 解析页面,提取数据

 1 def parse_one_page(html):
 2     soup = BeautifulSoup(html, \'lxml\')
 3     items = soup.select(\'dd\')
 4     if items:
 5         for item in items:
 6             yield {
 7                 \'index\': item.find(\'i\').text,
 8                 \'image\': item.find(\'img\', class_="board-img").get(\'data-src\'),
 9                 \'title\': item.find(\'p\').text,
10                 \'actor\': item.find(\'p\', class_="star").text.strip()[3:],
11                 \'time\': item.find(\'p\', class_="releasetime").text.strip()[5:],
12                 \'score\': item.find(\'i\', class_="integer").text + item.find(\'i\', class_="fraction").text
13             }

6. 爬虫主函数

1 def main(offset):
2     url = \'http://maoyan.com/board/4?\' + str(offset)
3     html = get_one_page(url)
4     for item in parse_one_page(html):
5         print(item)
6         write_to_file(item)

7. 开启多进程

1 if __name__ == \'__main__\':
2     pool = Pool()
3     pool.map(main, [i*10 for i in range(10)])

 

完整代码:https://github.com/huazhicai/Spider/tree/master/maoyantop

分类:

技术点:

相关文章: