【大数据】爬虫综合大作业

作业要求来自于 https://edu.cnblogs.com/campus/gzcc/GZCC-16SE2/homework/3075

全球影迷期待的《复仇者联盟4》将于4月24日在内地上映。影片自开启预售以来，就创下一系列票房纪录——在美国仅用6小时就刷新了预售首日票房纪录，在中国内地用36小时达到预售票房2亿元，不但创造了中国影史预售最快破亿纪录，也是中国影史零点场票房冠军。这部电影的评价如何？观众们的观影体验是怎样的？我们可以用爬虫知道！

首先走进豆瓣电影，打开复仇者联盟4的详情界面：https://movie.douban.com/subject/26100958/，在底部可以看到电影的评价：

它显示有短评139215条，但是我们却没有办法获取所有的短评，在未登录的情况下只能看到200条短评，登录之后也只能得到500条短评。

为了防止爬取过程中ip被禁，我们需要设置一定的爬取间隔：

import time
time.sleep(5)

另外，还需要使用合理的user-agent模拟真实的浏览器去提取内容：

headers={
    \'User-Agent\':\'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36\'
    }

打开开发者工具，分析提取所需的信息：

可以发现，影评都在名为comment-item的class里，具体评分属于class rating。每页评论只有20条，获取更多评论要转到下一个页面。

下面开始解析网页

import requests
from bs4 import BeautifulSoup
import csv
import time

headers={
    \'User-Agent\':\'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36\'
    }

movie=\'https://movie.douban.com/subject/26100958/\'
start_url=movie +\'comments\'

def getContent(start_url):
    try:
        time.sleep(1)
        response = requests.get(start_url, headers=headers)
        # 判断相应状态，200表示请求成功
        if response.status_code == 200:
            return response.content
    except Exception as e:
        print(\'出错!\')
        return None
    res=requests.get(start_url,headers=headers)
    res.encoding=\'utf-8\'
    soup = BeautifulSoup(res.text,\'html.parser\')
    comments=soup.select(\'.comment-item\')
    for comment in comments:
        try:
            rating=comment.find(\'span\',class_=\'rating\')[\'title\']\
                   and comment.find(\'span\',class_=\'rating\')[\'title\']or\'\'
        except:
            continue
        content=comment.find(\'p\').text.strip() and comment.find(\'p\').text.strip() or \'\'
        print(\'评分：\'+rating,\'评论：\'+content)
        with open("avengers_douban.csv", "a", encoding="utf-8") as f:
            csvFile = csv.writer(f)
            if i == 0:
                csvFile.writerow([ \'星级\',  \'评论\'])
            f.write(rating,content + "\n")
i=0
getContent(start_url)

for i in range(1,1048):
    num = i*20
    nextPage = \'?start=\'+str(num)+\'&limit=20&sort=new_score&status=P\'
    nextUrl = movie+\'/comments\'+nextPage
    print(i)
    getContent(nextUrl)
    time.sleep(5)

爬取到的部分影评：

最后爬了大概有5000条以上的影评，统计词频后的结果为：

（词云莫名加载失败，无错误提示）