Python爬取小说《斗罗大陆》

本人新手，学习python爬虫时间不久，写的不好之处，还请各大神谅解。

这个是我们今天要爬取的小说地址：斗罗大陆

在爬取小说之前，我们先对网页内容进行了解。首先按F12打开开发者工具，分析网页代码，如下图所示：发现所有章节在div id = " list "的下面。这样，我们得到了小说的章节和标题。
在这里插入图片描述
接下来分析每个章节页面的内容，（同上，按F12对正文网页进行检查），经过观察发现，文章的所有内容都存放在div id = " content "的标签下。如下图：

这样，我们的准备工作就算完成了。下面就开始编写我们的爬虫程序了。实现准备要用到的库，requests，re，os，lxml。

接着，我们我们导入相关库，并仿造一个请求头，关于请求头的相关内容这里就不做介绍了，在上一篇博文中有介绍过（爬取表情包）

下面就开始我们的爬虫了。首先，定义一个函数，用来获取网页源代码，并通过etree与xpath对网页源码进行解析，代码如下：

def get_info(url):
    response = requests.get(url, headers=headers)
    response.encoding = \'utf-8\'
    get_info_list = []
    html = etree.HTML(response.text)
    dd_list = html.xpath(\'//*[@id="list"]/dl/dd\')
    for dd in dd_list:
        title = dd.xpath(\'a/text()\')[0]
        href = \'http://www.biquku.la/0/421/\' + dd.xpath(\'a/@href\')[0]
        chapter = {\'title\': title, \'href\': href}
        get_info_list.append(chapter)
    return get_info_list

现在开始获取每个章节的内容，这里用正则表达式获取内容，并写入文件中。代码如下：

def get_content(get_info):
    for chapter_info in get_info:
        response = requests.get(url=chapter_info[\'href\'], headers=headers)
        response.encoding = \'utf-8\'
        if os.path.exists(\'斗罗大陆\'):
            pass
        else:
            os.makedirs(\'斗罗大陆\')
        contents = re.findall(\'<div id="content">(.*?)</div>\', response.text)
        with open(\'./斗罗大陆/\' + chapter_info[\'title\'] + \'.txt\', \'w\', encoding=\'utf-8\') as f:
            for content in contents:
                f.write(content.replace(\'&nbsp;&nbsp;&nbsp;&nbsp;\', \'\').replace(\'<br/><br/>\', \'\n\').strip())
            print(\'下载成功\')

最后调用主函数，执行爬虫程序，代码如下：

if __name__==\'__main__\':
	get_content(get_info(url))

看看我们的爬取结果吧，这下可以在本地阅读了。
在这里插入图片描述
附源代码：

import requests
import re, os
from lxml import etree

headers = {
    \'User-Agent\': \'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36\'
}
url = \'http://www.biquku.la/0/421/\'


def get_info(url):
    response = requests.get(url, headers=headers)
    response.encoding = \'utf-8\'
    get_info_list = []
    html = etree.HTML(response.text)
    dd_list = html.xpath(\'//*[@id="list"]/dl/dd\')
    for dd in dd_list:
        title = dd.xpath(\'a/text()\')[0]
        href = \'http://www.biquku.la/0/421/\' + dd.xpath(\'a/@href\')[0]
        chapter = {\'title\': title, \'href\': href}
        get_info_list.append(chapter)
    return get_info_list


def get_content(get_info):
    for chapter_info in get_info:
        response = requests.get(url=chapter_info[\'href\'], headers=headers)
        response.encoding = \'utf-8\'
        if os.path.exists(\'斗罗大陆\'):
            pass
        else:
            os.makedirs(\'斗罗大陆\')
        contents = re.findall(\'<div id="content">(.*?)</div>\', response.text)
        with open(\'./斗罗大陆/\' + chapter_info[\'title\'] + \'.txt\', \'w\', encoding=\'utf-8\') as f:
            for content in contents:
                f.write(content.replace(\'&nbsp;&nbsp;&nbsp;&nbsp;\', \'\').replace(\'<br/><br/>\', \'\n\').strip())
            print(\'下载成功\')


if __name__ == \'__main__\':
    get_content(get_info(url))

如有错误，欢迎私信纠正，谢谢支持！