Python3 抓取百度贴吧图片

我抓取的地址是http://tieba.baidu.com/p/3125473879?pn=2，这个帖子共有82页左右，下面的代码主要抓取82页的所有图片，具体代码如下：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27

"""抓取百度贴吧图片"""
#导入模块

import re

import urllib

from urllib.request import urlopen,urlretrieve
#获取抓取页面的源代码

def getHtml(url):

    page = urlopen(url)

    html = str(page.read())

    page.close()

    return html
#通过源代码以及正则表达式，匹配我们的url

def getImg(html):

    reg = r'<img class="BDE_Image" src="(.+?\.jpg)" '

    imgre = re.compile(reg)

    imglist = re.findall(imgre,html)

    x = 0

    for imgurl in imglist:

        urlretrieve(imgurl,'C:\\Users\\Water\\PycharmProjects\\test\\image\\%s-%s.jpg' % (i,x))

        x = x + 1
#调用函数

i = 1

while i < 83:

    html = getHtml("http://tieba.baidu.com/p/3125473879?pn=" + str(i))

    getImg(html)

    i+=1

    print(i)

抓取结果如下，我这里只是简单些一下，以后再详细介绍。

本文转自 wzlinux 51CTO博客，原文链接：http://blog.51cto.com/wzlinux/1787514，如需转载请自行联系原作者