Building our own recommendation systems with the TMDB 5000 movies dataset
Objectives of this Tutorial
Here are some objectives for you:
- Learn what recommendation systems are, how they work, and some of their different flavors
- Implement a few recommendation systems using Python and the TMDB 5000 movies dataset
What are Recommendation Systems?
A recommendation system (also commonly referred to as a recommendation/recommender engine/platform) seeks to predict a user’s interest in available items (songs on Spotify, for example) and give recommendations accordingly. There are two primary types of recommendation systems:
- Content-based filtering systems make recommendations based on the characteristics of the items themselves. So if a Netflix user has been binging sci-fi movies, Netflix would be quicker to recommend another sci-fi movie over a romantic comedy. We’ll implement this recommendation system in Python.
- Collaborative filtering systems make recommendations based on user interactions. Let’s say that we both bought an electric guitar on Amazon and that I also bought an amp. Then Amazon would predict that you’d also be interested in that amp and would recommend it to you.
Credit to Ibtesam Ahmed for her Kaggle kernel on this dataset. This article is designed to follow her tutorial in a Medium-stylized format.
Building a Basic Recommendation System
As always, we’ll import the necessary packages and the datasets first:
Those two print statements give us the following output:
- credits: (4803, 4)
- movies_incomplete: (4803, 20)
So we’re working with 4,803 movies. Notice that our data are split into two dataframes right now. Refer to this gist to see how to combine and clean up the dataframes. It might be easiest to keep this gist open while following the tutorial.
We’ll start with two very basic recommendation systems — we’ll recommend the user a list of the highest rated movies and another list of the most popular movies. But first we’ll want to find the weighted average for each movie’s average rating (the vote_average values). Following Ibtesam’s lead, we’ll use the formula IMDB (formerly) used to calculate weighted ratings for movies.
Here’s one example of how to get the weighted averages:
I selected 0.70 as my argument for quantile() to indicate that I was concerned only with movies that received at least as many votes as 70% of the movies of our dataset. Selecting our value for m is a bit arbitrary, so do try some experimentation here.
Now we’re ready for our first recommendation system. Let’s recommend ten movies with the highest weighted average ratings:
And we get this lovely graph of our highest rated picks:
We see that our inaugural system recommended some classics. But what if we want to recommend movies that are popular among TMDB users?
We can use the popularity feature of our data to recommend movies based on popularity instead:
And now we can see our recommendations based on popularity scores:
Ah, just as we expected: a standout performance from Minions. Now what if we want to recommend movies based on their weighted average ratings and their popularity scores?
In order to avoid the colossal popularity score of Minions skewing our new scoring system, I normalized the values in both the weighted_average and popularity columns. I decided to go with a 50/50 split between the scaled weighted average rating and popularity scores, but again don’t be afraid to experiment with this split:
Now that we have a new score column that takes into account a movie’s weighted average rating and it’s popularity score, we can see what movies our recommender system will offer us:
And here are our recommendations based on my 50/50 split:
These recommenders worked as intended, but we can certainly improve. Now we’ll have to turn to content-based filtering.
So now we’re interested in using the characteristics of a movie in order to recommend other movies to the user. Again following Ibtesam’s example, we’ll now make recommendations based on the movie’s plot summaries given in the overview column. So if our user gives us a movie title, our goal is to recommend movies that share similar plot summaries.
Word Vectorization and TF-IDF
Before we can begin any analysis on the plot summaries, we’ll have to convert our text in the overview column to word vectors, and we’ll have to fit a TF-IDF on overview as well:
And we receive the following output:
So about 10,000 unique words were used in the plot summaries to describe our 5,000 movies (note that this figure is smaller than Ibtesam’s because I increased the minimum word frequency to 3 with min_df=3). If you’re interested in more, I talk about TF-IDF in this article, too.
Calculating Similarity Scores
Now that we have a matrix of our words, we can begin calculating similarity scores. This metric will help us pick out movies with plot summaries similar to the movie submitted by the user. Ibtesam opted for the linear kernel, but I wanted to experiment with the sigmoid kernel for fun. Luckily, I arrived at similar results:
So now that we’ve constructed our content-based filtering system, let’s test it out with timeless favorite, Spy Kids:
And here are our recommendations per the content-based filtering system:
So our recommendation system gave us some picks related to Spy Kids, but a few missteps such as In Too Deep and Escobar: Paradise Lost slipped in.
Based on our results above, we can see that our content-based filtering system has some limitations:
- Our recommender picked some movies that would probably be deemed inappropriate by a user searching for titles related to Spy Kids. To improve our system, we could consider replacing TF-IDF with word counts, and we could also explore other similarity scores.
- Our system only considers the plot summaries of each movie as it stands now. If we, like Ibtesam, consider other features such as the cast members, the director, and genre, we’ll probably improve in finding related movies.
- Our current system only recommends movies based on similarities in characteristics. So our recommender is missing movies in other genres that the user might enjoy. We’d need to try collaborative filtering to solve this, but our dataset didn’t include user information.
To sum up, we covered the following:
- What recommender systems are, how they work, and some of the different types
- How to implement very basic recommender systems based on weighted average ratings, popularity, and a blend of the two
- How to create a content-based filtering system and how to recognize the limitations of content-based recommendations alone