Guided Project: Analyzing movie reviews

2019独角兽企业重金招聘Python工程师标准>>> Guided Project: Analyzing movie reviews

1: Movie Reviews

In this project, you'll be working with Jupyter notebook, and analyzing data on movie review scores. By the end, you'll have a notebook that you can add to your portfolio or build on top of on your own. If you need help at any point, you can consult our solution notebook here.

The dataset is stored in thefandango_score_comparison.csv file. It contains information on how major movie review services rated movies. The data originally came fromFiveThirtyEight.

Here are the first few rows of the data, in CSV format:

Guided Project: Analyzing movie reviews

Each row represents a single movie. Each column contains information about how the online moview review services RottenTomatoes,Metacritic, IMDB, and Fandango rated the movie. The dataset was put together to help detect bias in the movie review sites. Each of these sites has 2 types of score -- User scores, which aggregate user reviews, and Critic score, which aggregate professional critical reviews of the movie. Each service puts their ratings on a different scale:

RottenTomatoes -- 0-100, in increments of1.
Metacritic -- 0-100, in increments of 1.
IMDB -- 0-10, in increments of .1.
Fandango -- 0-5, in increments of .5.

Typically, the primary score shown by the sites will be the Critic score. Here are descriptions of some of the relevant columns in the dataset:

FILM -- the name of the movie.
RottenTomatoes -- the RottenTomatoes (RT) critic score.
RottenTomatoes_User -- the RT user score.
Metacritic -- the Metacritic critic score.
Metacritic_User -- the Metacritic user score.
IMDB -- the IMDB score given to the movie.
Fandango_Stars -- the number of stars Fandango gave the movie.

To make it easier to compare scores across services, the columns were normalized so their scale and rounding matched the Fandango ratings. Any column with the suffix _norm is the corresponding column changed to a 0-5 scale. For example, RT_norm takes theRottenTomatoes column and turns it into a 0-5 scale from a 0-100 scale. Any column with the suffix _round is the rounded version of another column. For example,RT_user_norm_round rounds the RT_user_normcolumn to the nearest .5.

Instructions

Read the dataset into a Dataframe called movies using Pandas.
You can output a Dataframe as a table by typing just the variable name containing the Dataframe in the last line of a Jupyter cell. Do this with movies and look over the table.
If you're unfamiliar with RottenTomatoes, Metacritic, IMDB, orFandango, visit the websites to get a better handle on their review methodology.

import pandas
movies = pandas.read_csv("fandango_score_comparison.csv")

2: Histograms

Now that you've read the dataset in, you can do some statistical exploration of the ratings columns. We'll primarily focus on theMetacritic_norm_round and theFandango_Stars columns, which will let you see how Fandango and Metacritic differ in terms of review scores.

Instructions

Enable plotting in Jupyter notebook with import matplotlib.pyplot as plt and run the following magic%matplotlib inline.
Create a histogram of the Fandango_Stars column.
Look critically at both histograms, and write up any differences you see in a markdown cell.

import matplotlib.pyplot as plt
%matplotlib inline
plt.hist(movies["Fandango_Stars"])
plt.hist(movies["Metacritic_norm_round"])

3: Mean, Median, And Standard Deviation

In the last screen, you may have noticed some differences between the Fandango and Metacritic scores. Metrics we've covered, including the mean, median, and standard deviation, allow you to quantify these differences. You can apply these metrics to the Fandango_Stars andMetacritic_norm_round columns to figure out how different they are.

Instructions

Calculate the mean of both Fandango_Stars andMetacritic_norm_round.
Calculate the median of both Fandango_Stars andMetacritic_norm_round.
Calculate the standard deviation of both Fandango_Stars andMetacritic_norm_round. You can use the numpy.std method to find this.
Print out all the values and look over them.
Look at the review methodologies for Metacritic and Fandango. You can find the metholodogies on their websites, or by usingGoogle. Do you see any major differences? Write them up in a markdown cell.
Write up the differences in numbers in a markdown cell, including the following:
- Why would the median for Metacritic_norm_round be lower than the mean, but the median for Fandango_Starsis higher than the mean? Recall that the mean is usually larger than the median when there are a few large values in the data, and lower when there are a few small values.
- Why would the standard deviation for Fandango_Starsbe much lower than the standard deviation forMetacritic_norm_round?
- Why would the mean for Fandango_Stars be much higher than the mean for Metacritic_norm_round.

import numpy

f_mean = movies["Fandango_Stars"].mean()
m_mean = movies["Metacritic_norm_round"].mean()
f_std = movies["Fandango_Stars"].std()
m_std = movies["Metacritic_norm_round"].std()
f_median = movies["Fandango_Stars"].median()
m_median = movies["Metacritic_norm_round"].median()

print(f_mean)
print(m_mean)
print(f_std)
print(m_std)
print(f_median)
print(m_median)

4: Scatter Plots

We know the ratings tend to differ, but we don't know which movies tend to be the largest outliers. You can find this by making a scatterplot, then looking at which movies are far away from the others.

You can also subtract the Fandango_Starscolumn from the Metacritic_norm_roundcolumn, take the absolute value, and sortmovies based on the difference to find the movies with the largest differences between their Metacritic and Fandango ratings.

Instructions

Make a scatterplot that compares the Fandango_Stars column to the Metacritic_norm_round column.
Several movies appear to have low ratings in Metacritic and high ratings in Fandango, or vice versa. We can explore this further by finding the differences between the columns.
- Subtract the Fandango_Stars column from theMetacritic_norm_round column, and assign to a new column, fm_diff, in movies.
- Assign the absolute value of fm_diff to fm_diff. This will ensure that we don't only look at cases whereMetacritic_norm_round is greater thanFandango_Stars.
  - You can calculate absolute values with the absolutefunction in NumPy.
- Sort movies based on the fm_diff column, in descending order.
- Print out the top 5 movies with the biggest differences between Fandango_Stars andMetacritic_norm_round.

plt.scatter(movies["Metacritic_norm_round"], movies["Fandango_Stars"])

5: Correlations

Let's see what the correlation coefficient betweenFandango_Stars and Metacritic_norm_roundis. This will help you determine if Fandango consistently has higher scores than Metacritic, or if only a few movies were assigned higher ratings.

You can then create a linear regression to see what the predicted Fandango score would be based on the Metacritic score.

Instructions

Calculate the r-value measuring the correlation betweenFandango_Stars and Metacritic_norm_round using thescipy.stats.pearsonr function.
The correlation is actually fairly low. Write up a markdown cell that discusses what this might mean.
Use the scipy.stats.linregress function create a linear regression line with Metacritic_norm_round as the x-values andFandango_Stars as the y-values.
Predict what a movie that got a 3.0 in Metacritic would get on Fandango using the line.

movies["fm_diff"] = numpy.abs(movies["Metacritic_norm_round"]-

movies["Fandango_Stars"])movies.sort("fm_diff", ascending=False).head(5)
from scipy.stats import pearsonr
r_value, p_value = pearsonr(movies["Fandango_Stars"], movies["Metacritic_norm_round"])
r_value
from scipy.stats import linregress
slope, intercept, r_value, p_value, stderr_slope = linregress(movies["Metacritic_norm_round"], movies["Fandango_Stars"])
pred = 3 * slope + intercept
pred

6: Finding Residuals

In the last screen, you created a linear regression for relating Metacritic_norm_round toFandango_Stars. You can create a residual plot to better visualize how the line relates to the existing datapoints. This can help you see if two variables are linearly related or not.

Instructions

Predict what a movie that got a 4.0 in Metacritic would get on Fandango using the line from the last screen.
Make a scatter plot using the scatter function inmatplotlib.pyplot.
On top of the scatter plot, use the plot function inmatplotlib.pyplot to plot a line using the predicted values for3.0 and 4.0.
- Setup the x values as the list [3.0, 4.0].
- The y values should be a list with the corresponding predictions.
- Pass in both x and y to plot to create a line.
Show the plot.

import random
random.seed(1)
random_100 = [random.randint(0, 5) for _ in range(100)]
random_100_x = numpy.array(random_100)
random_100_y = random_100_x * slope + intercept
fig = plt.figure(figsize=(16, 16))
ax = fig.add_subplot(111)
ax.plot(random_100_x, random_100_y, c='r', label='Prediction')
ax.scatter(movies['Metacritic_norm_round'], movies['Fandango_Stars'], c='b', label='Real')
plt.legend(loc='upper left');
plt.xlabel('Metacritic_norm_round')
plt.ylabel('Fandango_Stars')
ax.set_xlim([-0.5, 5.5])
ax.set_ylim([-0.5, 5.5])
sns.plt.show()

7: Next Steps

That's it for the guided steps! We recommend exploring the data more on your own.

Here are some potential next steps:

Explore the other rating services, IMDB and RottenTomatoes.
- See how they differ from each other.
- See how they differ from Fandango.
See how user scores differ from critic scores.
Acquire more recent review data, and see if the pattern of Fandango inflating reviews persists.
Dig more into why certain movies had their scores inflated more than others.

We recommend creating a Github repository and placing this project there. It will help other people, including employers, see your work. As you start to put multiple projects on Github, you'll have the beginnings of a strong portfolio.

You're welcome to keep working on the project here, but we recommend downloading it to your computer using the download icon above and working on it there.

转载于:https://my.oschina.net/Bettyty/blog/749694