Python可以做哪些好玩的事之将喜欢的博客整理成pdf

欢迎关注天善智能，我们是专注于商业智能BI，人工智能AI，大数据分析与挖掘领域的垂直社区，学习，问答、求职一站式搞定！

对商业智能BI、大数据分析挖掘、机器学习，python，R等数据领域感兴趣的同学加微信：tsaiedu，并注明消息来源，邀请你进入数据爱好者交流群，数据爱好者们都在这儿。

最近在学习ETL，于是在天善关键词搜索，光看目录就已经觉得很牛逼了～ <数据仓库设计、ETL设计框架>等等。作为一个爱学习的人，看到这么多有内涵的博客，当然想学习新技能(flag+1)，但是我更习惯在手机上浏览，如果我想在手机上看，网页端显然是不太方便的，所以果断转换成pdf存一份(说干就干)

1.采集数据

有一段时间没有在博客中分析了，今天就不(luo)厌(li)其(luo)烦(suo)再头来一遍。

chrome浏览器右键检查，在弹出窗口中选择network，这时点击我们想要查看的博客链接，天善社区的博客列表显然是972这个，有的时候链接不一定是在XHR分类下，具体网页我们要具体分析。

点击972，查看RequestURL，复制这个链接打开，发现获得的内容和我们当前点开的链接一样，此时我们就大工告成了一半。

import requests

url = 'https://ask.hellobi.com/blog/biwork/972'

my_headers = {

"Accept-Language":"zh-CN,zh;q=0.9",

"Host":"ask.hellobi.com",

"User-Agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.75 Safari/537.36"

}

r = requests.get(url=url, headers=my_headers)

print(r.content)

写到这以为完成了一半，那就大错特错了，既然我们想将整个目录都转换为pdf，那么只采集一篇怎么能行，这里我们就需要采集所有的文章地址。

天善博客社区提供博客地图，这对我们的提取所有的链接提供了便利，所以下一步我们就来采集这些链接。

代码

import requests

import random

from lxml import etree

from fake_useragent import UserAgent

ua = UserAgent()

map_url = 'https://ask.hellobi.com/blog/biwork/sitemap/'

map_headers = {

"Host": "ask.hellobi.com",

"Upgrade-Insecure-Requests": "1",

"User-Agent": ua.random,

}

rel = requests.get(url=map_url, headers=map_headers).content

html = etree.HTML(rel)

datas = html.xpath('//div[@class="col-md-12"]')

list_url = []

for data in datas[0]:

blog_urls = data.xpath('./li/a/@href')

print(blog_urls)

2.将网页转换为pdf

既然要转换pdf，我们就需要使用一个神器。wkhtmltopdf：https://wkhtmltopdf.org/

生成PDF时会自动根据你在HTML页面中的标签生成树形目录结构，同时也可以在通过相应的函数设置将网页中的指定部分转换为pdf。

终端输入命令

pip install wkhtmltopdf

pip install pdfkit

匹配查看网页html信息，找出作者，标题，作者

soup = BeautifulSoup(result, 'html.parser')

body = soup.find_all(class_='message-content')

写入html文件

# 将标题加入正文居中

center_title = soup.new_tag('center')

title_tag = soup.new_tag('h1')

title_tag.string = title

center_title.insert(1, title_tag)

html = str(body)

print(html)

html = html_template.format(content=html)

html = html.encode("utf-8")

f_name = ".".join([str(index), "html"])

with open(f_name, 'wb') as f:

f.write(html)

htmls.append(f_name)

# 将html文件合并为pdf

pdfkit.from_file(htmls, user_name + "的文章合辑.pdf", options=options)

执行完毕 pdf效果

大功告成，我们可以愉快的把想看的博客转换成pdf了。

完整代码已上传Github：https://github.com/ReainL/tszn_pdf.

作者：许胜利

文章来源：天善智能社区https://ask.hellobi.com/blog/zhiji/11321