Building a Hybrid Recommendation System
Introduction
So, I write this article to capture every decision I make while working on my RS class task also, to document my code which the colab notebooks and other things available here in Google Drive
Problem Definition
The purpose of the task is to implement a recommendation system, combining Collaborative Filtering (has to be Matrix Factorization) and Content-Based Filtering(could be anything).
The dataset that I used in this experiment is MovieLens dataset with 600 Users, 9000 movies, and 100K ratings.
Solution Idea
Since I don't have any experience with hybrid RS before, I decided to look for some papers that implement the CF-CBF hybrid method. I found a survey conducted by Çano et al.[1]. In that paper, I learn that there is some other way to implement a recommender system such as demographic and knowledge-based, which interesting for me to explore later. For this article, I will try to implement a framework described by Zhao et al.[2].
The Conventional CF method is estimating the rating of items that have not been rated by the target user based on rating history. As the paper says that when focusing on optimizing some ranking, targets may lose the semantic information about the recommendation scenario. This framework tries to simulate the steps of users generating their data. The writer called this The two-step recommendation framework.
The writers believe that a recommendation task is a ranking task. So the ranking for every pair for user and item defined as the probability the user will rate the item, times the rating prediction itself.
For this task, I will try to imitate what Zhao et al[2] did. But because my task requires me to use the movie feature, I will modify the first step. I will explain this along the way.
Preprocessing.
this is what I process in Movies.csv dataset:
- Count Vectorize “Genres”
- Count Vecotrize “Title”
- TF-IDF Vectorize “Tags”
I choose not to use TF-IDF in genres because I think TF-IDF is not suited to the idea of looking at “Genres” and “Titles”. Since the IDF makes the most common terms not as impactful as other terms. This TF-IDF idea suited well to the “Tags” terms, so I used TF-IDF to vectorized Tags.
this is what I process in Ratings.csv dataset:
- “Extract” whether the movie rated or not
- “Extract” the movie-user ratings
- Split the rating dataset into 5 Folds Validations.
Each fold will run into this process:
First Step: Content-Based Filtering
The idea from the framework is to get a probability of a user will rate an item. So, I’ve extracted the rated and non-rated movies from a user, it will be 1 and 0 respectively. Then i “learn” the similarity between movies using cosine similarity. From that, for every movie, find the maximum similarity to movies that have been rated. If a user has not rated any movie (cold-start), I will assume the user preferences will be 1 for every movie. This means that we rely on the CF predicted rating.
Second Step: Collaborative Filtering.
For collaborative Filtering, I used the SVD library from Scikit-learn to train every fold. the result of this process is a matrix of the predicted rating of user-item pair.
Generate Recommendation
Assumption: This recommender system should give the top 20 recommendations of movies that have been rated 5.
The probability that we got in the CBF step, will be multiplied with the predicted rating as the two-step framework defined. Then sort the movies based on the result of the multiplication, select only the top 20 movies for a user in every fold.
Last Step: Evaluation
The evaluation method that required in my class task is to evaluate with recall@20.
This Modified Two-Step Framework mean recall score is 0.3. Since the original paper got a better result, I think there is “too much information” to consider by a system, then it didn't get the same result.
Conclusion
This recommendation system is suffered if faced with cold-start data. But, on the other side, this framework (two-step framework) get high accuracy and kinda scaleable since every step can be processed parallel and doesn't depend too much on another process result.
I learned a lot in this project, so many mistakes, so many things to learn, and a whole lot more to explore in the Recommendation System field.
References
[1] Çano, Erion. (2017). Hybrid Recommender Systems: A Systematic Literature Review. Intelligent Data Analysis. 21. 1487–1524. 10.3233/IDA-163209.
[2] Zhao, X., Niu, Z., Chen, W. et al. A hybrid approach of topic model and matrix factorization based on two-step recommendation framework. J Intell Inf Syst 44, 335–353 (2015). https://doi.org/10.1007/s10844-014-0334-3