Movies always leave outstanding impressions on me. From big blockbusters to indie films, each movie
brings a unique and enjoyable experience. I am interested in movie related datasets because there are a lot
of relationships and patterns between movies and their corresponding information and data. There
are countless questions that can be hinted at from such datasets. Some of them question the popularity of
movies against certain review scores. I wonder if certain MPAA ratings like G, R, or NC-17 can affect the
score of films and or its gross in the box ofice. How have movies and rating groups done well in terms of
gross and scores over the years?
Note: The selection from this dataset consists of movies with the country entry of 'USA', non-null
MPAA content ratings, and year entries from 1916 to 2016. The original IMDb Dataset selection, which
consists of around 5000 records can be found
here. The massaged dataset with
the previous conditions applied can be seen here, and the original dataset
can be seen here. This selection and dataset does not represent
all movies and is just a small sample set. All findings will not hold the same if the sample size
included all movies produced.
Above is the distribution of movies in my dataset grouped by year of release. There is a minimal amount of movies before the 1970s and 1980s, where blockbusters started dominating the cinema landscape. Since the original dataset was collected in late 2016, the amount of films from that year in retrospect to previous years is significantly less.
Similar to the previous visualization, this is the distribution of movies grouped by its IMDb score as voted by IMDb users. Unlike the year distribution, the IMDb score distribution fits to a nice curve maxing out at a 6.7 rating. What is important is that as we get closer to the ends of the distribution, the number of films for each rating decreases. There is one film each at the two highest ratings: The Shawshank Redemption at 9.3 and The Godfather at 9.2. (Side Note: The Shawshank Redemption is currently at 9.2 and tied with The Godfather as of 2017.)
This pie chart provides the distribution of movies grouped by MPAA Rating. The majority of films have a content rating of either PG, PG-13, or R. This is can explained by the notion that movies released in theaters today are mostly either G, PG, PG-13, or R rated, with the majority of them being the last two. Movies that are rated NC-17 or Unrated usually do not get released in theaters and are usually shown at festivals or are released straight to DVD or streaming services.
Movie budget to box office gross in the United States is compared in the above scatterplot. There are many films that are made with a budget less than $100,000,000, and a good majority of them make their money back in gross with a good amount making double or triple the budget in gross. While there are some outliers for massive blockbusters, some errors in the data can be seen where the gross or budget of a film is $0. As noted from the dataset scraper author, the dataset could not retrieve all values in some of the fields for movies, so they are filled with the value $0.
The heatmap shows the summed gross of movies grouped by IMDb score and MPAA rating. The summed gross value range is represented by the color yellow to blue, where yellow is the lowest and blue is the highest. Since it is the summed gross, the highest values mostly occur in groupings that contain the most amount of movies. The ratings PG, PG-13, and R contain the highest summed gross for films and this can be attributed to the earlier explanation where these films are the most popular films released in theaters, which generates income thus allowing opportunities to become blockbuster hits.
In the series chart above, each line represents a specific MPAA rating's average IMDb score across each year. One thing to notice is that from this visualization, some of the ratings in the past were used and discontinued, such as Approved and Passed. It appears that the overall trend of average IMDb scores decrease over the years due to a number of reasons, but is not truly represented for all data. I believe that for this dataset, there are outliers for earlier dated films, in which some of them may have been well received and or are popular films. Another reason could be due to the fact that IMDb is a website that existed since the start of the internet. The demographic of users on IMDb are most likely just as old as the internet so many of these users may rate only films made after 1990 or similar. This in turn causes a slight decrease in average rating.
Above is a line chart with two interesting lines for average domestic box office gross and for average
number of IMDb votes for movies grouped by IMDb rating. One thing I wanted to look at is the popularity of a
movie defined by the amount of people who paid to see the movie as shown by the gross line, and the amount of
people who have rated it on IMDb which should be sometime after they have seen a movie. There seems to be a
trend of the average gross increasing as each IMDb score goes higher. This can be explained with people wanting
to spend their money on watching good movies. The exceptions are the drastic outliers at both sides of the
chart. The lowest score on the graph is a 1.6 for the infamous hit movie Justin Bieber: Never Say Never, while
at the other side is the critically acclaimed but non-blockbuster The Shawshank Redemption. These outliers are
caused solely to the fact that so few movies are near these ratings. A reason as to why there is a sudden
increase in IMDb votes beyond IMDb scores of 8.0 is due to the fact that these movies start to hit the IMDb Top
250 Movie list.
From the visualizations overall, I have found that even though this dataset selection is somewhat flawed with
some errors and even though it does not represent all movies ever made, this dataset selection does provide
some nice information about movies released in theaters, their introduction to popularity and revenue numbers,
and its relation to scores and ratings by the MPAA and by users on IMDb.