百度贴吧爬虫

python脚本爬取王者荣耀吧

本篇博客介绍了这次由于项目需求，爬取百度贴吧--王者荣耀吧的帖子的过程。

一、安装第三方库

pip install requests
pip install bs4
pip install lxml
pip install html5lib

二、源码分析

1. 分析请求链接的规律

　　F12打开开发者工具，百度搜索王者荣耀吧，在Network选项卡找到对应请求，可以看到其请求链接基本遵循如下规律：

https://tieba.baidu.com/f?kw=%E7%8E%8B%E8%80%85%E8%8D%A3%E8%80%80&ie=utf-8&tab=corearea&pn=450
kw: 贴吧名字（王者荣耀）
ie: 编码方式
tab: 首页标签
pn: 页码

其中首页标签的选项卡如下，可以一个一个点一遍试试，看看对应URL的tab字段是什么值：

2. 分析Response

　　找到该请求的响应，可以看到每一条帖子的概要内容都在，没有用AJAX，故无需分析XHR。

但是需要注意的是，我们需要爬取的内容是每一条帖子的信息和内容，这一部分没有包含在<html>...</html>标签内，而是在之后另外用<code>......</code>包裹，并且，是注释内容。因此，在用beautifulsoup解析的时候，如果不做处理，是无法解析出我们想要的内容的。

3. 正式开始爬取

　　需求：爬取王者荣耀吧的帖子并保存到本地，可以选择页数，选择标签，选择指定日期之前的帖子，选择包含关键词的帖子。每一条帖子包含标题、链接、发表日期，详细内容，所有回帖。

（1）发送请求，获取响应

 1 def get_html(post_name, tab, pn):
 2     """
 3     获取html
 4     :param post_name: 贴吧名
 5     :param tab: 标签名
 6     :param pn: 页码
 7     :return:
 8     """
 9     try:
10         url = \'https://tieba.baidu.com/f\'
11 
12         headers = {
13             \'User-Agent\': \'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) \
14                           Chrome/75.0.3770.100 Safari/537.36\'
15         }
16         # tag:
17         # 核心区：corearea; 看帖：main
18         data = {
19             \'kw\': post_name,
20             \'tab\': tab,
21             \'pn\': pn,
22         }
23         response = requests.get(url, params=data, headers=headers, timeout=30)
24         # 必须修改HTML页面，把HTML结束标签改到最后，否则soup解析只到原来的HTML标签就结束了，后面的code标签里的内容被丢弃
25         html = response.text.replace(\'</body>\', \'\')
26         html = html.replace(\'</html>\', \'\')
27         response = html + \'</body></html>\'
28         # response.encoding = \'utf-8\'
29         # print(response.text)
30         return response
31     except RuntimeError:
32         return \'ERROR\'

注意代码中的注释，根据之前的分析，我们要的内容都不在<html><body>...</body></html>标签包裹之内，后面soup无法解析到，所以要修改获得的源码。

（2）解析响应

 1 def get_post_info(html, m, pn):
 2     """
 3     获取帖子的标题、链接信息，并从中筛选出有特定关键词的帖子
 4     :param html: 处理后的HTML页面
 5     :param m: month
 6     :param pn: 页码
 7     :return: 帖子信息
 8     """
 9     url = \'https://tieba.baidu.com\'
10     soup = BeautifulSoup(html, \'lxml\')
11     # 找到目标code标签，返回tag列表
12     code = soup.find_all(\'code\', attrs={\'id\': \'pagelet_html_frs-list/pagelet/thread_list\'})
13     # 提取code标签的内容（注释），返回列表
14     comment = code[0].contents
15     # print(type(comment[0]))
16     # comment = code[0].string
17     # print(type(comment))
18     # 重新开始解析comment
19     soup = BeautifulSoup(comment[0], \'lxml\')
20     # soup = BeautifulSoup(comment, \'lxml\')
21 
22     # 找到目标li标签
23     info = []
24 
25     # # 先找到置顶帖
26     # litags_top = soup.find_all(\'li\', attrs={\'class\': \'j_thread_list thread_top j_thread_list clearfix\'})
27     # for li in litags_top:
28     #     info_top = dict()
29     #     try:
30     #         info_top[\'title\'] = li.find(\'a\', attrs={\'class\': \'j_th_tit\'}).text.strip()
31     #         info_top[\'link\'] = \'\'.join([url, li.find(\'a\', attrs={\'class\': \'j_th_tit\'})[\'href\']])
32     #         info_top[\'time\'] = li.find(\'span\', attrs={\'class\': \'pull-right is_show_create_time\'}).text.strip()
33     #         info.append(info_top)
34     #     except:
35     #         print("错误：获取置顶帖标题失败！")
36 
37     # 再找到常规帖，提取标题、链接、发表日期、摘要信息
38     litags = soup.find_all(\'li\', attrs={\'class\': \'j_thread_list clearfix\'})
39     for li in litags:
40         try:
41             info_norm = dict()
42             info_norm[\'title\'] = li.find(\'a\', attrs={\'class\': \'j_th_tit\'}).text.strip()
43             info_norm[\'link\'] = \'\'.join([url, li.find(\'a\', attrs={\'class\': \'j_th_tit\'})[\'href\']])
44             info_norm[\'date\'] = li.find(\'span\', attrs={\'class\': \'pull-right is_show_create_time\'}).text.strip()
45             info_norm[\'abstract\'] = li.find(\'div\', attrs={\'class\': \'threadlist_abs threadlist_abs_onlyline\'}). \
46                 text.strip()
47             info.append(info_norm)
48         except AttributeError as e:
49             print("错误：%s，可能是因为没有找到相应的标签" % e.args)
50         except:
51             print("错误：获取常规帖标题及摘要失败！")
52 
53     print(\'第 %s 页已经爬取成功， 开始处理...\' % (pn/50+1))
54     # 筛选发表日期在一个月以内，且标题和摘要里有关键词[\'发热\'，\'卡\'， \'掉帧\'， \'\']的帖子
55     # 获取当日日期
56     today = time.strftime(\'%m-%d\', time.localtime(time.time()))
57     month = int(today.split(\'-\')[0])
58     day = int(today.split(\'-\')[1])
59 
60     if month - m >= 1:
61         last_month = month - m
62     else:
63         last_month = 12 + (month - m)
64     # if last_month == 2 and day >= 29:
65     #     one_month_before = \'\'.join([str(last_month), \'-\', \'28\'])
66     # else:
67     #     one_month_before = \'\'.join([str(last_month), \'-\', str(day)])
68 
69     # num = len(info)
70     info_new = []
71     for post in info:
72         if \':\' in post[\'date\']:
73             info_new.append(post)
74         elif int(post[\'date\'].split(\'-\')[0]) == last_month and int(post[\'date\'].split(\'-\')[1]) >= day:
75             info_new.append(post)
76         elif int(post[\'date\'].split(\'-\')[0]) == month and int(post[\'date\'].split(\'-\')[1]) <= day:
77             info_new.append(post)
78 
79     # # 关键词分开存放
80     # keywords = [\'发热\', \'卡顿\', \'掉帧\', \'卡死\']
81     # num = len(keywords)
82     # info_has_kw = [[] for i in range(num)]
83     # for post in info_new:
84     #     for i in range(num):
85     #         if keywords[i] in post[\'abstract\']:
86     #             info_has_kw[i].append(post)
87     #             break
88 
89     print(\'第 %s 页已经处理完成，开始爬取下一页...\' % (pn/50+1))
90     # return info_has_kw
91     return info_new

（3）保存到本地

def save2file(info, savepath=os.path.dirname(os.path.realpath(__file__))+\'\\post.txt\'):
    """
    将爬取到的帖子内容写入到本地，保存到指定目录的txt文件中，保存目录默认为当前目录。
    :param info: 帖子内容
    :param savepath: 输出文件路径，默认为当前目录
    :return:
    """
    # num = len(info)
    # for i in range(num):
    #     with open(savepath, \'a+\') as f:
    #         for post in info[i]:
    #             f.write(\'标题：{} \t 链接：{} \t\'.format(post[\'title\'], post[\'link\']))
    with open(savepath, \'a+\') as f:
        for post in info:
            f.write(\'标题：{} \t 链接：{} \t\'.format(post[\'title\'], post[\'link\']))
    print("当前页面已经保存到本地！")

（4）主程序

if __name__ == \'__main__\':
    post_name = \'王者荣耀\'
    tab = \'main\'
    # 循环控制爬取的页数
    for pn in range(10):
        html = get_html(post_name, tab, pn*50)
        info = get_post_info(html, 3, pn*50)
        # print(info)
        save2file(info)
    print(\'-------所有帖子下载完成-------\')