Python爬取豆瓣《复仇者联盟3》评论并生成乖萌的格鲁特

代码地址如下：
http://www.demodashi.com/demo/13257.html

1. 需求说明

本项目基于Python爬虫，爬取豆瓣电影上关于复仇者联盟3的所有影评，并保存至本地文件。然后对影评进行分词分析，使用词云生成树人格鲁特的形象照片。

2. 代码实现

此部分主要解释Python爬虫部分及使用词云生成图像的代码

Python爬虫

首先获取需要爬取的网页地址，然后通过requests.get()方式去获取网页，代码如下：

# 获取网页
def getHtml(url):
    try:
        r = requests.get(url, timeout=30)
        r.raise_for_status()
        return r.text
    except:
        return \'\'

获取到网页之后，对网页中的元素进行正则匹配，找到评论相关的元素，并获取。

# 获取某个网页中的影评
def getComment(html):
    soup = BeautifulSoup(html, \'html.parser\')
    comments_list = []
    comment_nodes = soup.select(\'.comment > p\')
    for node in comment_nodes:
        comments_list.append(node.get_text().strip().replace("\n", "") + u\'\n\')
    return comments_list

将爬取到的影评保存至文本文件中，以备后续分析使用。

def saveCommentText(fpath):
    pre_url = "https://movie.douban.com/subject/24773958/comments?"
    depth = 8
    with open(fpath, \'a\', encoding=\'utf-8\') as f:
        for i in range(depth):
            url = pre_url + \'start=\' + str(20 * i) + \'&limit=20&sort=new_score&\' + \'status=P\'
            html = getHtml(url)
            f.writelines(getComment(html))
            time.sleep(1 + float(random.randint(1, 20)) / 20)

基于词云生成图像

注释比较详细，可以看注释说明

def drawWordcloud():
    with codecs.open(\'text.txt\', encoding=\'utf-8\') as f:
        comment_text = f.read()
    # 设置背景图片,可替换为img目录下的任何一张图片
    color_mask = imread("img\Groot4.jpeg")
    # 停用词设置
    Stopwords = [u\'就是\', u\'电影\', u\'你们\', u\'这么\', u\'不过\', u\'但是\',
                 u\'除了\', u\'时候\', u\'已经\', u\'可以\', u\'只是\', u\'还是\', u\'只有\', u\'不要\', u\'觉得\', u\'，\'u\'。\']
    # 设置词云属性
    cloud = WordCloud(font_path="simhei.ttf",
                      background_color=\'white\',
                      max_words=260,
                      max_font_size=150,
                      min_font_size=4,
                      mask=color_mask,
                      stopwords=Stopwords)
    # 生成词云, 可以用generate输入全部文本,也可以我们计算好词频后使用generate_from_frequencies函数
    word_cloud = cloud.generate(comment_text)
    # 从背景图片生成颜色值(注意图片的大小)
    image_colors = ImageColorGenerator(color_mask)

    # 显示图片
    plt.imshow(cloud)
    plt.axis("off")
    # 绘制词云
    plt.figure()
    plt.imshow(cloud.recolor(color_func=image_colors))
    plt.axis("off")
    plt.figure()
    plt.imshow(color_mask, cmap=plt.cm.gray)
    plt.axis("off")
    plt.show()
    # 保存图片
    word_cloud.to_file("img\comment_cloud.jpg")

为了方便阅读，这里贴出整体过程编码：

def getHtml(url):
    try:
        r = requests.get(url, timeout=30)
        r.raise_for_status()
        return r.text
    except:
        return \'\'


def getComment(html):
    soup = BeautifulSoup(html, \'html.parser\')
    comments_list = []
    comment_nodes = soup.select(\'.comment > p\')
    for node in comment_nodes:
        comments_list.append(node.get_text().strip().replace("\n", "") + u\'\n\')
    return comments_list


def saveCommentText(fpath):
    pre_url = "https://movie.douban.com/subject/24773958/comments?"
    depth = 8
    with open(fpath, \'a\', encoding=\'utf-8\') as f:
        for i in range(depth):
            url = pre_url + \'start=\' + str(20 * i) + \'&limit=20&sort=new_score&\' + \'status=P\'
            html = getHtml(url)
            f.writelines(getComment(html))
            time.sleep(1 + float(random.randint(1, 20)) / 20)


def cutWords(fpath):
    text = \'\'
    with open(fpath, \'r\', encoding=\'utf-8\') as fin:
        for line in fin.readlines():
            line = line.strip(\'\n\')
            text += \' \'.join(jieba.cut(line))
            text += \' \'
    with codecs.open(\'text.txt\', \'a\', encoding=\'utf-8\') as f:
        f.write(text)


def drawWordcloud():
    with codecs.open(\'text.txt\', encoding=\'utf-8\') as f:
        comment_text = f.read()
    # 设置背景图片
    color_mask = imread("img\Groot4.jpeg")
    # 停用词设置
    Stopwords = [u\'就是\', u\'电影\', u\'你们\', u\'这么\', u\'不过\', u\'但是\',
                 u\'除了\', u\'时候\', u\'已经\', u\'可以\', u\'只是\', u\'还是\', u\'只有\', u\'不要\', u\'觉得\', u\'，\'u\'。\']
    # 设置词云属性
    cloud = WordCloud(font_path="simhei.ttf",
                      background_color=\'white\',
                      max_words=260,
                      max_font_size=150,
                      min_font_size=4,
                      mask=color_mask,
                      stopwords=Stopwords)
    # 生成词云, 可以用generate输入全部文本,也可以我们计算好词频后使用generate_from_frequencies函数
    word_cloud = cloud.generate(comment_text)
    # 从背景图片生成颜色值(注意图片的大小)
    image_colors = ImageColorGenerator(color_mask)

    # 显示图片
    plt.imshow(cloud)
    plt.axis("off")
    # 绘制词云
    plt.figure()
    plt.imshow(cloud.recolor(color_func=image_colors))
    plt.axis("off")
    plt.figure()
    plt.imshow(color_mask, cmap=plt.cm.gray)
    plt.axis("off")
    plt.show()
    # 保存图片
    word_cloud.to_file("img\comment_cloud.jpg")

三、项目结构

项目结构

注意整个项目只有一个源码文件，其他的为图片文件

四、运行效果图

一大波格鲁特来袭

格鲁特1号

格鲁特2号

格鲁特3号

格鲁特4号

Python爬取豆瓣《复仇者联盟3》评论并生成乖萌的格鲁特

代码地址如下：
http://www.demodashi.com/demo/13257.html

注：本文著作权归作者，由demo大师代发，拒绝转载，转载需要作者授权