1: Movie Reviews
In this project, you'll be working with Jupyter notebook, and analyzing data on movie review scores. By the end, you'll have a notebook that you can add to your portfolio or build on top of on your own. If you need help at any point, you can consult our solution notebook here.
The dataset is stored in thefandango_score_comparison.csv file. It contains information on how major movie review services rated movies. The data originally came fromFiveThirtyEight.
Here are the first few rows of the data, in CSV format:
Each row represents a single movie. Each column contains information about how the online moview review services RottenTomatoes,Metacritic, IMDB, and Fandango rated the movie. The dataset was put together to help detect bias in the movie review sites. Each of these sites has 2 types of score -- User scores, which aggregate user reviews, and Critic score, which aggregate professional critical reviews of the movie. Each service puts their ratings on a different scale:
- RottenTomatoes --
0-100, in increments of1. - Metacritic --
0-100, in increments of1. - IMDB --
0-10, in increments of.1. - Fandango --
0-5, in increments of.5.
Typically, the primary score shown by the sites will be the Critic score. Here are descriptions of some of the relevant columns in the dataset:
-
FILM-- the name of the movie. -
RottenTomatoes-- the RottenTomatoes (RT) critic score. -
RottenTomatoes_User-- the RT user score. -
Metacritic-- the Metacritic critic score. -
Metacritic_User-- the Metacritic user score. -
IMDB-- the IMDB score given to the movie. -
Fandango_Stars-- the number of stars Fandango gave the movie.
To make it easier to compare scores across services, the columns were normalized so their scale and rounding matched the Fandango ratings. Any column with the suffix _norm is the corresponding column changed to a 0-5 scale. For example, RT_norm takes theRottenTomatoes column and turns it into a 0-5 scale from a 0-100 scale. Any column with the suffix _round is the rounded version of another column. For example,RT_user_norm_round rounds the RT_user_normcolumn to the nearest .5.
Instructions
- Read the dataset into a Dataframe called
moviesusing Pandas. - You can output a Dataframe as a table by typing just the variable name containing the Dataframe in the last line of a Jupyter cell. Do this with
moviesand look over the table. - If you're unfamiliar with RottenTomatoes, Metacritic, IMDB, orFandango, visit the websites to get a better handle on their review methodology.
import pandas
movies = pandas.read_csv("fandango_score_comparison.csv")
2: Histograms
Now that you've read the dataset in, you can do some statistical exploration of the ratings columns. We'll primarily focus on theMetacritic_norm_round and theFandango_Stars columns, which will let you see how Fandango and Metacritic differ in terms of review scores.
Instructions
- Enable plotting in Jupyter notebook with
import matplotlib.pyplot as pltand run the following magic%matplotlib inline. - Create a histogram of the
Fandango_Starscolumn. - Look critically at both histograms, and write up any differences you see in a markdown cell.
import matplotlib.pyplot as plt %matplotlib inline plt.hist(movies["Fandango_Stars"]) plt.hist(movies["Metacritic_norm_round"])
3: Mean, Median, And Standard Deviation
In the last screen, you may have noticed some differences between the Fandango and Metacritic scores. Metrics we've covered, including the mean, median, and standard deviation, allow you to quantify these differences. You can apply these metrics to the Fandango_Stars andMetacritic_norm_round columns to figure out how different they are.
Instructions
- Calculate the mean of both
Fandango_StarsandMetacritic_norm_round. - Calculate the median of both
Fandango_StarsandMetacritic_norm_round. - Calculate the standard deviation of both
Fandango_StarsandMetacritic_norm_round. You can use the numpy.std method to find this. - Print out all the values and look over them.
- Look at the review methodologies for Metacritic and Fandango. You can find the metholodogies on their websites, or by usingGoogle. Do you see any major differences? Write them up in a markdown cell.
- Write up the differences in numbers in a markdown cell, including the following:
- Why would the median for
Metacritic_norm_roundbe lower than the mean, but the median forFandango_Starsis higher than the mean? Recall that the mean is usually larger than the median when there are a few large values in the data, and lower when there are a few small values. - Why would the standard deviation for
Fandango_Starsbe much lower than the standard deviation forMetacritic_norm_round? - Why would the mean for
Fandango_Starsbe much higher than the mean forMetacritic_norm_round.
- Why would the median for
import numpy f_mean = movies["Fandango_Stars"].mean() m_mean = movies["Metacritic_norm_round"].mean() f_std = movies["Fandango_Stars"].std() m_std = movies["Metacritic_norm_round"].std() f_median = movies["Fandango_Stars"].median() m_median = movies["Metacritic_norm_round"].median() print(f_mean) print(m_mean) print(f_std) print(m_std) print(f_median) print(m_median)
4: Scatter Plots
We know the ratings tend to differ, but we don't know which movies tend to be the largest outliers. You can find this by making a scatterplot, then looking at which movies are far away from the others.
You can also subtract the Fandango_Starscolumn from the Metacritic_norm_roundcolumn, take the absolute value, and sortmovies based on the difference to find the movies with the largest differences between their Metacritic and Fandango ratings.
Instructions
- Make a scatterplot that compares the
Fandango_Starscolumn to theMetacritic_norm_roundcolumn. - Several movies appear to have low ratings in Metacritic and high ratings in Fandango, or vice versa. We can explore this further by finding the differences between the columns.
- Subtract the
Fandango_Starscolumn from theMetacritic_norm_roundcolumn, and assign to a new column,fm_diff, inmovies. - Assign the absolute value of
fm_difftofm_diff. This will ensure that we don't only look at cases whereMetacritic_norm_roundis greater thanFandango_Stars.- You can calculate absolute values with the absolutefunction in NumPy.
- Sort
moviesbased on thefm_diffcolumn, in descending order. - Print out the top
5movies with the biggest differences betweenFandango_StarsandMetacritic_norm_round.
- Subtract the
plt.scatter(movies["Metacritic_norm_round"], movies["Fandango_Stars"])
5: Correlations
Let's see what the correlation coefficient betweenFandango_Stars and Metacritic_norm_roundis. This will help you determine if Fandango consistently has higher scores than Metacritic, or if only a few movies were assigned higher ratings.
You can then create a linear regression to see what the predicted Fandango score would be based on the Metacritic score.
Instructions
- Calculate the r-value measuring the correlation between
Fandango_StarsandMetacritic_norm_roundusing thescipy.stats.pearsonr function. - The correlation is actually fairly low. Write up a markdown cell that discusses what this might mean.
- Use the scipy.stats.linregress function create a linear regression line with
Metacritic_norm_roundas the x-values andFandango_Starsas the y-values. - Predict what a movie that got a
3.0in Metacritic would get on Fandango using the line.
movies["fm_diff"] = numpy.abs(movies["Metacritic_norm_round"]-
movies["Fandango_Stars"])movies.sort("fm_diff", ascending=False).head(5)
from scipy.stats import pearsonr
r_value, p_value = pearsonr(movies["Fandango_Stars"], movies["Metacritic_norm_round"])
r_value
from scipy.stats import linregress
slope, intercept, r_value, p_value, stderr_slope = linregress(movies["Metacritic_norm_round"], movies["Fandango_Stars"])
pred = 3 * slope + intercept
pred
6: Finding Residuals
In the last screen, you created a linear regression for relating Metacritic_norm_round toFandango_Stars. You can create a residual plot to better visualize how the line relates to the existing datapoints. This can help you see if two variables are linearly related or not.
Instructions
- Predict what a movie that got a
4.0in Metacritic would get on Fandango using the line from the last screen. - Make a scatter plot using the scatter function in
matplotlib.pyplot. - On top of the scatter plot, use the plot function in
matplotlib.pyplotto plot a line using the predicted values for3.0and4.0.- Setup the
xvalues as the list[3.0, 4.0]. - The
yvalues should be a list with the corresponding predictions. - Pass in both
xandyto plot to create a line.
- Setup the
- Show the plot.
import random
random.seed(1)
random_100 = [random.randint(0, 5) for _ in range(100)]
random_100_x = numpy.array(random_100)
random_100_y = random_100_x * slope + intercept
fig = plt.figure(figsize=(16, 16))
ax = fig.add_subplot(111)
ax.plot(random_100_x, random_100_y, c='r', label='Prediction')
ax.scatter(movies['Metacritic_norm_round'], movies['Fandango_Stars'], c='b', label='Real')
plt.legend(loc='upper left');
plt.xlabel('Metacritic_norm_round')
plt.ylabel('Fandango_Stars')
ax.set_xlim([-0.5, 5.5])
ax.set_ylim([-0.5, 5.5])
sns.plt.show()
7: Next Steps
That's it for the guided steps! We recommend exploring the data more on your own.
Here are some potential next steps:
- Explore the other rating services, IMDB and RottenTomatoes.
- See how they differ from each other.
- See how they differ from Fandango.
- See how user scores differ from critic scores.
- Acquire more recent review data, and see if the pattern of Fandango inflating reviews persists.
- Dig more into why certain movies had their scores inflated more than others.
We recommend creating a Github repository and placing this project there. It will help other people, including employers, see your work. As you start to put multiple projects on Github, you'll have the beginnings of a strong portfolio.
You're welcome to keep working on the project here, but we recommend downloading it to your computer using the download icon above and working on it there.
转载于:https://my.oschina.net/Bettyty/blog/749694