lmt921108

1.爬虫实战项目,爬取小说,只能爬取免费小说(VIP小说需要充钱登陆:方法有所差异,后续会进行讲解)

  本教程出于学习目的,如有犯规,请留言联系

  爬取网站:起点中文网,盗墓笔记免费篇

  https://book.qidian.com/info/68223#Catalog

 

2.网页结构分析

 

 

 

 

 

 结构分析发现:每一大标题在div元素里面,是否免费,包含在div元素的孙子元素span的类属性里面(class=\'free\'  还是 class=\'vip\')

因此:如果我们想要提取免费章节小说,需要先根据span元素进行判断。

 

3.完整代码

#!/usr/bin/env python
#-*- coding:utf-8 -*-

\'\'\'爬取盗墓笔记小说免费版
\'\'\'


import requests
from bs4 import BeautifulSoup


headers = {
    \'User-Agent\': \'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36\'
}

class Story(object):
    
    def __init__(self,url):
        self.url = url
    
    
    def get_html(self,url):
        try:
            response = requests.get(url,headers=headers)
            if response.status_code == 200:
                return response.text
            else:
                return None
        except Exception as e:
            print(\'wrong\', e)
    
    
    def get_soup(self,html):
        try:
            soup = BeautifulSoup(html,\'html.parser\')
        except:
            soup = BeautifulSoup(html, \'xml\')
        return soup


    def start(self):
        html = self.get_html(self.url)
        soup = self.get_soup(html)
        
        try:
            free_result = soup.select(\'div.volume span.free\') 
            if free_result:
                for free in free_result:
                    chapters = free.parent.parent.select(\'li a\')  # 理解为什么要找到parent元素
                    for chapter in chapters:
                        title = chapter.text.strip().replace(\' \', \'_\')
                        href = \'https:\' + chapter[\'href\']
                        
                        html = self.get_html(href)
                        soup = self.get_soup(html)
                        content =  soup.select(\'div.read-content\')[0].text.strip().replace(\'\u3000\', \' \')
                        print(\'\033[1;34m开始爬取:  {title}\033[0m\'.format(**locals()))
                        with open(title+\'.txt\', \'w\') as fw:
                            fw.write(content)
        except:
            None
    
    
    
if __name__ == \'__main__\':

    url = \'https://book.qidian.com/info/68223#Catalog\'

    gg = Story(url)
    gg.start()
    

 

分类:

技术点:

相关文章: