Recommendation systems are essential for providing personalized suggestions to users. Whether it’s suggesting movies, products, or music, these systems enhance user experience by tailoring content based on individual preferences. The real challenge comes when the implementation is done in a real-world scenario. Using the ML framework in Python, a single machine will be unable to handle the ever-growing size of data as the model needs to be fed with more and more recent data.
Why Distributed Computing is Essential for Big Data in Recommendation Systems
This is where we need a distributed computing approach to run algorithms in parallel. With terabytes or even petabytes of data, it’s impossible to load data of such size into a single machine. Now, all algorithms might not work with a cluster of machines, or even if they work, it is useless if they cannot utilize the distributed computing approach to run the algorithm. So, we need a machine learning model (or framework) that can train on dataset spreading across from cluster of machines.
Method 1: ALS with Spark ML
Alternating Least Square (ALS) is a matrix factorization algorithm. Its main advantage is that it can run in parallel, which satisfies our requirement of using a distributed computing technique to handle big data. ALS is implemented in Apache Spark ML and built for a large-scale collaborative filtering problem.
The working mechanism of ALS
As mentioned earlier, ALS is a matrix factorization algorithm and uses gradient descent with a twist. Let’s take an example of 2 users and 3 products,
The goal of Alternating Least Squares is to find two matrixes, U and P, whose product is approximately equal to the original matrix of users and products. Once such matrices have been found, we can predict what user i will think of product j by multiplying row i of U with row j of P.
ALS minimizes two loss functions alternatively. First, it holds the user matrix fixed and runs gradient descent with the product matrix, and then it holds the product matrix fixed and runs gradient descent with the user matrix.
So, how does this run in parallel?
ALS runs its gradient descent in parallel across multiple partitions of the underlying training data from a cluster of machines.
Implementation
You should have Spark setup in your system to use this. Starting with a set of imports:
from pyspark.sql import SparkSession
from pyspark.ml import Pipeline
from pyspark.ml.recommendation import ALS
Once the data is read and split into training and testing, the ALS model must be set-up:
als = ALS(maxIter=20,
userCol="userId",
itemCol="productId",
rank=5,
ratingCol="rating",
regParam=1,
coldStartStrategy="drop",
seed=42)
The most important hyper-parameters are:
- maxIter: the maximum number of iterations to run (defaults to 10)
- rank: the number of latent factors in the model (defaults to 10)
- regParam: the regularization parameter in ALS (defaults to 1.0)
A pipeline can be built on top of this followed by GridSearch to optimize the hyper-parameters to get the best model.
Once the model is finalized, recommendations can be made on test data for validation and then can be implemented in a workflow. The steps involved in the workflow can be as follows:
- A new user inputs ratings for different products, and then system creates new user-product interaction samples for the model
- ALS model is retrained on data with the new inputs
- Product data for inference is created
- Rating predictions on all products for that user are made
- Top N product recommendations for that user based on the ranking of product rating predictions are shown to the user
Method 2: Vector DBs
Vector DBs have a very simple implementation with respect to the Recommendation System. It works on similarity searches and using this similarity search can be useful in our earlier example of product recommendations, where the goal is to find products that are similar to those a user has already viewed, bought or rated.
The working mechanism of Vector DBs
By representing products as vectors in a high-dimensional space, we can employ distance metrics (such as cosine similarity or Euclidean distance) to identify products that are ‘close’ to each other, indicating similarity.
As the number of movies and users grows, so does the size of the database. Vector databases are designed to handle large-scale data while maintaining high query performance. This scalability is essential for movie recommendation systems, especially those used by large streaming platforms with extensive movie libraries and user bases.
The architecture is divided into 2 parts:
- Candidate Generation: This basically means embedding these product ratings into text based on some initial filtering by product category, type, etc.
- Re-Ranking: Re-ranking is essentially carried out in the recommender system to arrange products according to the sentiments expressed in the textual information. With the assistance of large language models, we can obtain the opinion score of the textual data. The products will be re-ranked for the recommendation based on the opinion score.
A Variation of Vector DB: ScaNN
ScaNN is scalable Nearest Neighbors. For modern recommendation systems, at the retrieval stage we need to find the nearest datasets embedding for a given query embedding. Usually, the set if embedding is often too large for exhaustive search, so a tool like ScaNN is used to approximate neighborhood search.
ScaNN is a scalable nearest-neighbor search. We need to find the nearest dataset embedding for a given query embedding at the retrieval stage for modern recommendation systems. The set of embeddings is often too large for exhaustive search, so a tool like ScaNN is used to approximate neighborhood search.
Benefits:
ScaNN provides highly efficient vector similarity search, namely faster matching and retrieval of similar items from massive and moderate size databases.
It includes state-of-the-art implementation of modern ANN techniques.
- Challenge of Exhaustive Search: In a recommendation system, you want to find items like a user's preference or past interactions. With a vast dataset, exhaustively comparing the user's profile to every item becomes computationally expensive and impractical.
- Approximate Nearest Neighbors (ANN): ScaNN excels at performing ANN searches. It efficiently identifies items that are close enough (similar) to the user's preference, even if not the absolute exact match. This "good enough" approach significantly reduces processing time while maintaining high recommendation quality.
- Vector Embeddings: ScaNN relies on vector embeddings to represent both users and items. These embeddings are dense, low-dimensional vectors that capture the essential characteristics of users and items.
- Similarity Search: The core function of ScaNN is to find similar vectors in the embedding space. It employs a combination of techniques like quantization (reducing embedding size), vector decomposition (breaking down complex vectors), and graph-based search (leveraging relationships between items) to achieve fast and accurate similarity searches.
- Benefits for Recommendations: By efficiently finding similar items (nearest neighbors) to a user's preference, ScaNN enables recommendation systems to:
- Scale Effectively: Handle massive datasets of users and items without compromising responsiveness.
- Real-time Recommendations: Generate personalized recommendations quickly, enhancing user experience.
- Accuracy with Efficiency: Maintain a high degree of accuracy in recommendations while optimizing processing speed.