Recommendation systems with collaborative filtering to accelerate time to market

A hands-on guide to building and comparing memory-based and model-based collaborative filtering systems to quickly evaluate recommendation strategies.
Brett Young
Created on April 8|Last edited on May 6
Comment
Collaborative filtering recommender systems are the backbone of personalized experiences on platforms like Netflix, Amazon, and Spotify. Rather than depending on hand-crafted rules or manually labeled content metadata, these systems rely on patterns in user behavior - what people watch, click, rate, or purchase - to suggest new content that aligns with their preferences. The core idea is simple: users with similar past behavior are likely to enjoy similar things in the future.
This guide walks through the foundations of collaborative filtering, explores the differences between user-based, item-based, and model-based approaches, and shows how to build practical systems using real-world data.
If you'd like to jump straight to the tutorial, you can do so.
Jump to the tutorial﻿
﻿
To learn more about recommendation systems with collaborative filtering, read on ...
﻿
Table of contentsWhat is collaborative filtering?Comparison with content-based filteringTypes of collaborative filtering recommender systemsMemory-based collaborative filteringModel-based collaborative filteringHow collaborative filtering uses similarities between users and itemsAdvantages and disadvantages of collaborative filteringAdvantages of collaborative filteringDisadvantages of collaborative filteringTutorial: Building a collaborative filtering-based movie recommender systemImplementing model-based collaborative filteringWhich method is best? Conclusion 
﻿
What is collaborative filtering?Collaborative filtering is a method of building recommendations using only user behavior - think ratings, clicks, purchases, or views. The system looks at the interactions between users and items and uses that data to predict what each user will probably interact with in the future. It does this by identifying users or items that behave similarly and transferring knowledge from one to the other.
The term "collaborative" refers to the idea that all users contribute data to help make predictions for each other. If User A and User B rate the same movies similarly, the system assumes that User A might enjoy movies that User B has rated highly but hasn’t seen yet. This type of signal is powerful because it can reveal connections that aren’t obvious from item metadata. Collaborative filtering operates purely on interaction data and doesn't care about what the items are.
Comparison with content-based filteringContent-based filtering uses the attributes of items to recommend similar ones. For example, if a user liked a sci-fi movie starring a certain actor, the system might recommend other sci-fi movies or other movies with that actor. It relies on structured metadata and item descriptions to calculate similarity.
Collaborative filtering doesn’t need item features. It works purely off the crowd’s behavior. If 10,000 people who liked Movie A also liked Movie B, and you liked Movie A, then Movie B is worth showing to you - even if Movie B is in a different genre, from another country, or has no tags in common. Collaborative filtering learns these associations directly from how people behave, not from how content is described.
Types of collaborative filtering recommender systemsThere are two main kinds of collaborative filtering: memory-based and model-based. They both use the same kind of data - user-item interactions - but they handle it differently. Memory-based methods look directly at the raw interaction matrix and use statistical similarity scores to recommend items. These are usually easier to implement and work well when the data is dense or the use case is simple. Model-based methods build a machine learning model that learns latent patterns in the data. These models are trained to generalize and are more scalable and robust for large-scale systems.
Memory-based collaborative filteringMemory-based collaborative filtering for recommomendation systems come in two flavors: user-based and item-based.
User-based collaborative filtering looks for other users who are similar to the target user. It uses similarity metrics like cosine similarity or Pearson correlation to figure out which users have similar taste. Once similar users are found, the system looks at what those users liked that the target user hasn’t seen and uses those items to make recommendations.
Item-based collaborative filtering flips that logic. It looks at items that are similar to the ones the user already liked. If someone rated "The Matrix" and "Inception" highly, and those two movies have high item similarity, the system might recommend "Tenet" because it behaves similarly in the data.
﻿
User-based collaborative filteringUser-based collaborative filtering in recommomendation systems predict what you’ll like by finding other users who have similar tastes. Here’s exactly how it works:
First, the system represents every user by their ratings across all items. If User A rated "The Matrix" 5 stars, "Titanic" 1 star, and didn't rate "Inception," their rating vector might be [5, 1, NaN]. User B might have [5, 2, NaN], and User C might have [1, 5, 5].
Next, it measures similarity between users. It uses a metric like cosine similarity or Pearson correlation on their rating vectors. Users who rated many items similarly have a high similarity score.
Once similar users (neighbors) are identified, the system finds items that your neighbors rated highly but that you haven't interacted with yet. If User A and User B have very similar tastes and User B rated "Inception" highly, User A would receive "Inception" as a recommendation.
The key point: user-based filtering is based entirely on similarity between user rating patterns - not on any attributes of the items themselves.
Item-based collaborative filteringItem-based collaborative filtering in recommomendation systems do something different. It doesn't directly find similar users. Instead, it finds items that were rated similarly by the user population as a whole.
Here's how it works in practice:
First, each item is represented as a vector of all users' ratings for that item. For example, "The Matrix" could be [5, 5, 1] from Users A, B, and C, respectively, while "Inception" might be [NaN, NaN, 5], and "Titanic" might be [1, 2, 5].
Then, the system calculates similarity between items based on user ratings (again using cosine similarity or Pearson correlation). If "The Matrix" and "Inception" received similar ratings from the same users, their item vectors will have high similarity.
Finally, recommendations are generated by looking at items you've rated positively and identifying other items that behave similarly across the full user base. So if a user rates "The Matrix" highly, and the system sees that many other users who liked "The Matrix" also rated "Inception" highly (and similarly), it would recommend "Inception" next.
Why item-based collaborative filtering is NOT content-based filteringIn recommomendation systems, collaborative filtering does not use any explicit information about item features like genre, director, actors, or plot. It only uses the ratings given by users. Two movies could have completely different genres or actors, but if the same users rate them similarly, they are considered similar items. That's what differentiates item-based collaborative filtering from content-based filtering.
In short:
User-based filtering recommends items by finding similar users who rated items you haven't seen yet.
Item-based filtering recommends items by identifying items rated similarly across users, regardless of their actual content attributes.
Both types use similarity metrics to calculate closeness. Cosine similarity measures the angle between two rating vectors. Pearson correlation adjusts for differences in rating scale between users. These similarity scores are then used to weight the contributions of users or items when predicting a score.
Memory-based methods are easy to understand and implement. They work best when the user-item matrix is relatively dense and not too large. But they become slow and less effective as data grows in size and sparsity.
Model-based collaborative filteringModel-based collaborative filtering-based recommomendation systems use algorithms to learn from the data. A common method is matrix factorization. This takes the sparse user-item interaction matrix and decomposes it into two lower-dimensional matrices - one for users and one for items. These matrices represent each user and each item as vectors in a shared latent space. The dot product between a user vector and an item vector gives the predicted rating.
During training, the model adjusts the user and item vectors to minimize the difference between predicted and actual ratings. Once trained, it can generalize to predict ratings for user-item pairs that weren’t in the training data.
This approach handles sparsity better than memory-based methods. It’s also faster to serve recommendations once trained, since everything is reduced to vector math. You can also extend it with regularization, bias terms, deep learning, or hybrid models that mix in metadata or content-based features.
Model-based methods are preferred in large-scale systems because they’re scalable, fast at inference time, and can be continuously improved with more data.
How collaborative filtering uses similarities between users and itemsCollaborative filtering predicts preferences by finding similarities between users and items based purely on their historical ratings or interactions. The basic assumption is straightforward: users who rated items similarly in the past will continue to share similar preferences. Likewise, items rated similarly by the same group of users are considered alike.
To calculate these similarities, memory-based collaborative filtering methods often use metrics such as cosine similarity and Pearson correlation. Cosine similarity treats each user's ratings as a vector, where each dimension corresponds to an item. It calculates the similarity by measuring the angle between these rating vectors in a multidimensional space. Two users who have rated many items similarly will have vectors pointing roughly in the same direction, yielding a high similarity score close to 1. If their ratings differ significantly, the vectors will diverge, reducing similarity closer to 0 or even becoming negative.
Here's the equation for Cosine Similarity: 
cosine(u,v)=∑uivi∑ui2∑vi2\displaystyle
\text{cosine}(u, v) = \frac{\sum u_i v_i}{\sqrt{\sum u_i^2} \sqrt{\sum v_i^2}}
cosine(u,v)=∑ui2​​∑vi2​​∑ui​vi​​﻿
Pearson correlation also measures similarity, but it specifically captures how ratings vary linearly together, adjusting for individual rating scales. This means if two users both consistently rate movies higher or lower than their personal averages, their Pearson correlation will be high - even if one user rates everything higher overall. This metric thus accounts for individual rating habits, making it effective when user rating scales differ significantly.
Here's the equation for Pearson Correlation: 
r=∑(ui−uˉ)(vi−vˉ)∑(ui−uˉ)2∑(vi−vˉ)2\displaystyle
r = \frac{\sum (u_i - \bar{u})(v_i - \bar{v})}{\sqrt{\sum (u_i - \bar{u})^2 \sum (v_i - \bar{v})^2}}
r=∑(ui​−uˉ)2∑(vi​−vˉ)2​∑(ui​−uˉ)(vi​−vˉ)​﻿
Once calculated, these similarity scores determine how strongly a user’s or item's ratings influence recommendations. Higher similarity scores mean the ratings from that user or item receive greater weight when predicting unseen ratings. This mechanism allows collaborative filtering to identify hidden patterns in user preferences and item interactions without needing explicit information about item features.
This method of calculating similarity scores using cosine similarity or Pearson correlation is part of memory-based collaborative filtering. Memory-based systems rely on the full user-item interaction matrix and directly compute similarities between users or between items based on that matrix. The system does not train a model. Instead, it uses the existing data to make real-time comparisons and generate predictions based on the most similar users or items.
Model-based collaborative filtering works differently. It does not compute similarity scores directly. Instead, it learns to represent users and items as vectors in a lower-dimensional latent space. These vectors, or embeddings, are learned by training a machine learning model, often using matrix factorization or a neural network. The model learns to capture patterns in the data so that users and items with similar behavior or characteristics end up with similar embeddings. Recommendations are then made by computing the dot product of these vectors, which serves as a learned approximation of how likely a user is to interact with an item. Model-based methods are more scalable, can generalize better to sparse data, and allow for faster recommendation once the model is trained.
Advantages and disadvantages of collaborative filteringCollaborative filtering provides personalized recommendations based entirely on user interactions, eliminating the need for extensive manual feature engineering or detailed item metadata. It adapts dynamically to user behavior, enabling discovery of new or unexpected content. Despite these strengths, collaborative filtering also faces several important limitations. It depends heavily on historical data, which leads to problems such as difficulty handling new users or items, scalability challenges with large datasets, and bias toward recommending popular items.
Advantages of collaborative filteringCollaborative filtering creates personalized recommendations by directly leveraging user behavior. It naturally adapts to changing user preferences without human-curated metadata or rules. This approach also fosters unexpected discoveries, recommending items that users might not otherwise encounter through simple searches or content attributes. Additionally, collaborative filtering works effectively across different domains - movies, music, books, and ecommerce - since it requires no explicit domain knowledge or item features.
Disadvantages of collaborative filteringCollaborative filtering struggles with the cold start problem, which occurs when a new user or new item lacks sufficient historical interaction data. Without this data, the system cannot accurately infer preferences or similarities. Another issue is data sparsity, meaning most users interact with only a tiny fraction of available items, making accurate recommendations difficult. Additionally, collaborative filtering tends to amplify popularity bias, disproportionately recommending items that are already popular while overlooking niche or diverse content. Finally, memory-based collaborative filtering methods can face significant scalability problems as datasets grow large, since computing similarities becomes computationally intensive.
Tutorial: Building a collaborative filtering-based movie recommender systemIn this tutorial, we're going to build a memory-based collaborative filtering recommender system. Later, we'll also cover a model-based approach using matrix factorization, but first we'll focus on memory-based methods, which make predictions by directly comparing users or items using similarity scores derived from historical interaction data.
We'll use the MovieLens 20M dataset, which contains user ratings for movies. The dataset includes:
movie.csv: movie metadata (movieId, title)
rating.csv: user-item interactions (userId, movieId, rating, timestamp)
The goal is to predict a user’s rating for a movie they haven’t seen, using only the ratings of other users or other movies.
To download the dataset, simply run the following command to download the data:
pip install gdown && gdown https://drive.google.com/uc?id=1wIX7FjdUEeyRyi2oAZdtZLPzup0IySSS -O file.zip && unzip file.zip
We’ll implement both user-based and item-based collaborative filtering. Here’s the code: 
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
﻿
# limits
n_users = 5000
n_movies = 1000
﻿
# load ratings
ratings = pd.read_csv('./rating.csv')
﻿
# list of (userId, movieId) pairs to predict
examples = [
    (1, 541),
    (2, 356),
    (3, 1210),
    (4, 593),
    (5, 2571),
    (6, 260),
    (7, 1196),
    (8, 480),
    (9, 2959),
    (10, 50),
]
﻿
# force-include users and movies from the examples
important_users = [uid for uid, _ in examples]
important_movies = [mid for _, mid in examples]
﻿
top_users = ratings['userId'].value_counts().head(n_users).index.tolist()
top_movies = ratings['movieId'].value_counts().head(n_movies).index.tolist()
﻿
top_users = list(set(top_users + important_users))
top_movies = list(set(top_movies + important_movies))
﻿
filtered = ratings[ratings['userId'].isin(top_users) & ratings['movieId'].isin(top_movies)]
﻿
# build matrix
user_movie_df = filtered.pivot_table(index='userId', columns='movieId', values='rating')
user_movie_matrix = user_movie_df.fillna(0)
﻿
# User-based similarity calculation
user_sim = cosine_similarity(user_movie_matrix)
user_sim_df = pd.DataFrame(user_sim, index=user_movie_df.index, columns=user_movie_df.index)
﻿
# Item-based similarity calculation
item_sim = cosine_similarity(user_movie_matrix.T)
item_sim_df = pd.DataFrame(item_sim, index=user_movie_df.columns, columns=user_movie_df.columns)
﻿
def predict_rating_user_based(user_id, movie_id, k=10):
    if movie_id not in user_movie_df.columns or user_id not in user_movie_df.index:
        return np.nan
﻿
    sims = user_sim_df.loc[user_id]
    movie_ratings = user_movie_df[movie_id]
    valid_users = movie_ratings.dropna().index
﻿
    sims = sims[valid_users]
    ratings = movie_ratings[valid_users]
﻿
    if sims.empty:
        return np.nan
﻿
    top_k_users = sims.sort_values(ascending=False).head(k)
    top_k_ratings = ratings.loc[top_k_users.index]
﻿
    pred = np.dot(top_k_users.values, top_k_ratings.values) / np.sum(np.abs(top_k_users.values))
    return pred
﻿
def predict_rating_item_based(user_id, movie_id, k=10):
    if movie_id not in user_movie_df.columns or user_id not in user_movie_df.index:
        return np.nan
﻿
    sims = item_sim_df[movie_id]
    user_ratings = user_movie_df.loc[user_id]
    valid_items = user_ratings.dropna().index
﻿
    sims = sims[valid_items]
    ratings = user_ratings[valid_items]
﻿
    if sims.empty:
        return np.nan
﻿
    top_k_items = sims.sort_values(ascending=False).head(k)
    top_k_ratings = ratings.loc[top_k_items.index]
﻿
    pred = np.dot(top_k_items.values, top_k_ratings.values) / np.sum(np.abs(top_k_items.values))
    return pred
﻿
def recommend_top_n_user_based(user_id, n=5, k=10):
    if user_id not in user_movie_df.index:
        return []
﻿
    rated = user_movie_df.loc[user_id].dropna().index
    unrated = [movie for movie in user_movie_df.columns if movie not in rated]
﻿
    predictions = []
    for movie_id in unrated:
        pred = predict_rating_user_based(user_id, movie_id, k)
        if not np.isnan(pred):
            predictions.append((movie_id, pred))
﻿
    top_n = sorted(predictions, key=lambda x: x[1], reverse=True)[:n]
    return top_n
﻿
def recommend_top_n_item_based(user_id, n=5, k=10):
    if user_id not in user_movie_df.index:
        return []
﻿
    rated = user_movie_df.loc[user_id].dropna().index
    unrated = [movie for movie in user_movie_df.columns if movie not in rated]
﻿
    predictions = []
    for movie_id in unrated:
        pred = predict_rating_item_based(user_id, movie_id, k)
        if not np.isnan(pred):
            predictions.append((movie_id, pred))
﻿
    top_n = sorted(predictions, key=lambda x: x[1], reverse=True)[:n]
    return top_n
﻿
# Example usage
top_recs_user_based = recommend_top_n_user_based(user_id=1, n=5, k=10)
top_recs_item_based = recommend_top_n_item_based(user_id=1, n=5, k=10)
﻿
print("User-based recommendations:")
for movie_id, score in top_recs_user_based:
    print(f"Movie {movie_id} → predicted rating: {score:.2f}")
﻿
print("\nItem-based recommendations:")
for movie_id, score in top_recs_item_based:
    print(f"Movie {movie_id} → predicted rating: {score:.2f}")
As we saw above, memory-based methods compute user–user and item–item cosine similarities and then predict ratings by averaging neighbors’ known ratings. Next, let’s translate this into pandas and NumPy—step by step.
How it worksSimilarity matrices
User–user: Each user’s neighbor score comes from the cosine similarity of their rating vectors.
Item–item: Each movie’s neighbor score comes from the cosine similarity of its user-rating vector.
Prediction steps
User-based: For a given (user, movie), find neighbor users who rated that movie → weight their ratings by similarity → compute a weighted average.
Item-based: For a given (user, movie), find neighbor movies the user has rated → weight those ratings by similarity → compute a weighted average.
Implementation notes
Limit to the top 5,000 most active users and 1,000 most rated movies to keep similarity computations tractable
Fill missing ratings with zeros before computing similarities so our matrices remain dense and vector operations stay simple.
Manually include any critical test user–movie pairs to ensure they aren’t dropped during filtering.
With the pivot table built and both similarity matrices in hand, we can implement predict_rating_user_based and predict_rating_item_based functions, then generate top-N recommendations for any user and compare the two methods side by side.
Implementing model-based collaborative filteringNow, we're going to build a model-based collaborative filtering recommender system using matrix factorization. Unlike memory-based methods, which rely on direct comparisons between users or items, model-based methods learn patterns by training a model that embeds users and items into a shared latent space. This allows the system to generalize well even when there’s little direct overlap in the data - a huge advantage for sparse datasets. We’ll use the same MovieLens dataset used earlier.
It includes two files:
one for movie metadata (movie.csv) and
another for ratings (rating.csv).
After dropping missing values, we map user and movie IDs to integer indices, which are needed for building embedding layers in PyTorch. This remapping ensures the IDs are dense and start at zero, which makes them compatible with the embedding layer's requirements.
Once the data is preprocessed, we split it into training and test sets, then convert everything into PyTorch tensors. These tensors include user indices, movie indices, and the corresponding rating values. At this point, we define a matrix factorization model in PyTorch using two embedding layers - one for users and one for movies. The model predicts a rating by computing the dot product between the corresponding user and movie embeddings.
Here’s the full code used to build, train, and evaluate the model:
﻿
import os
import pandas as pd
import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.model_selection import train_test_split
import wandb
torch._dynamo.config.suppress_errors = True
# M1 Mac / CUDA / CPU device handling
def get_device():
    
    if torch.cuda.is_available():
        return torch.device("cuda")
    else:
        return torch.device("cpu")
﻿
device = get_device()
print(f"Using device: {device}")
﻿
# Load and preprocess data
ratings = pd.read_csv('./rating.csv')
ratings = ratings.dropna(subset=["userId", "movieId", "rating"])
﻿
user_ids = ratings['userId'].unique()
movie_ids = ratings['movieId'].unique()
user2idx = {user: i for i, user in enumerate(user_ids)}
movie2idx = {movie: i for i, movie in enumerate(movie_ids)}
﻿
ratings['user'] = ratings['userId'].map(user2idx)
ratings['movie'] = ratings['movieId'].map(movie2idx)
﻿
train_data, test_data = train_test_split(ratings, test_size=0.2, random_state=42)
﻿
train_users = torch.LongTensor(train_data['user'].values).to(device)
train_movies = torch.LongTensor(train_data['movie'].values).to(device)
train_ratings = torch.FloatTensor(train_data['rating'].values).to(device)
﻿
test_users = torch.LongTensor(test_data['user'].values).to(device)
test_movies = torch.LongTensor(test_data['movie'].values).to(device)
test_ratings = torch.FloatTensor(test_data['rating'].values).to(device)
﻿
# Model class
class MatrixFactorization(nn.Module):
    def __init__(self, n_users, n_movies, n_factors):
        super().__init__()
        self.user_factors = nn.Embedding(n_users, n_factors)
        self.movie_factors = nn.Embedding(n_movies, n_factors)
﻿
    def forward(self, user, movie):
        return (self.user_factors(user) * self.movie_factors(movie)).sum(1)
﻿
# Settings
n_users = len(user2idx)
n_movies = len(movie2idx)
n_chunks = 10
epochs = 1000
eval_every = 10
lr = 0.005
﻿
embedding_dims = [100, 10, 20, 50]
﻿
def get_chunks(tensor, n):
    return torch.chunk(tensor, n)
﻿
for embedding_dim in embedding_dims:
    wandb.init(project="mf-manual-batching", name=f"manual_dim_{embedding_dim}_chunks_{n_chunks}", config={
        "embedding_dim": embedding_dim,
        "chunks": n_chunks,
        "epochs": epochs
    })
﻿
    model = MatrixFactorization(n_users, n_movies, embedding_dim).to(device)
    model = torch.compile(model)  # PyTorch 2.x compile
﻿
    optimizer = optim.Adam(model.parameters(), lr=lr)
    loss_fn = nn.MSELoss()
﻿
    ckpt_path = f"mf_dim{embedding_dim}_chunks{n_chunks}.pth"
    start_epoch = 0
﻿
    if os.path.exists(ckpt_path):
        checkpoint = torch.load(ckpt_path, map_location=device)
        model.load_state_dict(checkpoint['model'])
        optimizer.load_state_dict(checkpoint['optimizer'])
        start_epoch = checkpoint['epoch']
        print(f"[dim {embedding_dim}] Resumed from epoch {start_epoch}")
﻿
    for epoch in range(start_epoch, epochs):
        model.train()
        perm = torch.randperm(train_users.size(0))
        users_shuffled = train_users[perm]
        movies_shuffled = train_movies[perm]
        ratings_shuffled = train_ratings[perm]
﻿
        user_chunks = get_chunks(users_shuffled, n_chunks)
        movie_chunks = get_chunks(movies_shuffled, n_chunks)
        rating_chunks = get_chunks(ratings_shuffled, n_chunks)
﻿
        epoch_loss = 0.0
        for u, m, r in zip(user_chunks, movie_chunks, rating_chunks):
            optimizer.zero_grad()
            preds = model(u, m)
            loss = loss_fn(preds, r)
            loss.backward()
            optimizer.step()
            epoch_loss += loss.item() * len(u)
﻿
        if (epoch + 1) % eval_every == 0:
            model.eval()
            with torch.no_grad():
                val_preds = model(test_users, test_movies)
                val_rmse = torch.sqrt(loss_fn(val_preds, test_ratings)).item()
            avg_loss = epoch_loss / len(train_users)
            print(f"[Dim {embedding_dim}] Epoch {epoch+1} | Train Loss: {avg_loss:.4f} | Val RMSE: {val_rmse:.4f}", flush=True)
            wandb.log({"epoch": epoch + 1, "train_loss": avg_loss, "val_rmse": val_rmse})
﻿
        if (epoch + 1) % 10 == 0:
            torch.save({
                'epoch': epoch + 1,
                'model': model.state_dict(),
                'optimizer': optimizer.state_dict()
            }, ckpt_path)
﻿
    if 1.0 in user2idx and 541 in movie2idx:
        uid = user2idx[1.0]
        mid = movie2idx[541]
        u_tensor = torch.LongTensor([uid]).to(device)
        m_tensor = torch.LongTensor([mid]).to(device)
        with torch.no_grad():
            pred = model(u_tensor, m_tensor).item()
        print(f"[Dim {embedding_dim}] Predicted rating for user 1 and movie 541: {pred:.2f}")
﻿
    wandb.finish()
﻿
After defining the model, we experiment with several different embedding sizes ([10, 20, 50, 100]) to see how dimensionality affects performance. For each dimension size, we train the model for 1000 epochs, using mean squared error loss and the Adam optimizer. Every few epochs, we evaluate on the test set and compute RMSE - a standard metric for rating prediction tasks.
We use Weights & Biases to log all metrics and track multiple runs. Each training run is clearly labeled with the embedding size so we can compare performance. This helps identify whether larger latent spaces actually lead to better generalization, or if performance plateaus beyond a certain size.
After training, we run a sample prediction using user 1 and movie 541 to inspect the model’s output. The final predicted rating comes from the dot product between the trained embeddings of that user and movie.
Here’s the results for our training run: 
﻿
Run set4
﻿
This matrix factorization setup gives us a strong, compact baseline for recommendation. It can be easily extended - we could add user or item biases, regularize the embeddings, or even introduce deeper architectures to capture nonlinear patterns. Compared to memory-based filtering, this model-based approach is more scalable and effective in handling cold-start-like gaps when rating data is sparse.
Which method is best? So you may be wondering: "which method should I choose for my recommendation system?" It depends on several factors - the size of your dataset, how sparse the interactions are, whether you care about interpretability, and how much infrastructure you have for training and serving models. 
Memory-based methods are simpler, more transparent, and easy to get running, but they struggle with large or sparse datasets.
Model-based methods are more scalable and tend to generalize better, especially when there’s limited overlap between users and items.
That said, the best approach is to just try both. Run a smaller-scale test of each method - maybe on a subset of your data - and compare results. See which approach gives better recommendations for your use case, or which one is easier to work with given your constraints. Don’t overthink it. Let the data and the results guide you.
I wrote a script that trains our best matrix factorization model for 1000 epochs, then compares its performance directly to memory-based methods using a shared test set of 500 samples. We evaluate user-based, item-based, and model-based predictions on the same examples and log final RMSE values to Weights & Biases using a side-by-side bar chart.
This setup gives a clear, apples-to-apples comparison between the approaches using the same dataset and evaluation protocol. If you’re trying to decide which approach works best for your use case, this type of small-scale experiment is a great place to start.
Here's the code: 
import os
import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_error, pairwise
import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.model_selection import train_test_split
import wandb
﻿
# W&B setup
wandb.init(project="recsys-comparison", name="mf_vs_mem_1000epoch", config={"embedding_dim": 10})
config = wandb.config
﻿
# Load data
ratings = pd.read_csv('./rating.csv')
n_users, n_movies = 5000, 1000
﻿
top_users = ratings['userId'].value_counts().head(n_users).index.tolist()
top_movies = ratings['movieId'].value_counts().head(n_movies).index.tolist()
﻿
filtered = ratings[ratings['userId'].isin(top_users) & ratings['movieId'].isin(top_movies)]
﻿
# Memory-based setup
user_movie_df = filtered.pivot_table(index='userId', columns='movieId', values='rating')
user_movie_matrix = user_movie_df.fillna(0)
﻿
user_sim_df = pd.DataFrame(pairwise.cosine_similarity(user_movie_matrix),
                           index=user_movie_df.index, columns=user_movie_df.index)
﻿
item_sim_df = pd.DataFrame(pairwise.cosine_similarity(user_movie_matrix.T),
                           index=user_movie_df.columns, columns=user_movie_df.columns)
﻿
def predict_user_based(user_id, movie_id, k=10):
    if movie_id not in user_movie_df.columns or user_id not in user_movie_df.index:
        return np.nan
    sims = user_sim_df.loc[user_id]
    ratings = user_movie_df[movie_id].dropna()
    sims = sims[ratings.index]
    if sims.empty: return np.nan
    top_k = sims.sort_values(ascending=False).head(k)
    return np.dot(top_k, ratings[top_k.index]) / np.sum(np.abs(top_k))
﻿
def predict_item_based(user_id, movie_id, k=10):
    if movie_id not in user_movie_df.columns or user_id not in user_movie_df.index:
        return np.nan
    sims = item_sim_df[movie_id]
    ratings = user_movie_df.loc[user_id].dropna()
    sims = sims[ratings.index]
    if sims.empty: return np.nan
    top_k = sims.sort_values(ascending=False).head(k)
    return np.dot(top_k, ratings[top_k.index]) / np.sum(np.abs(top_k))
﻿
# Model-based setup
user_ids = sorted(filtered['userId'].unique())
movie_ids = sorted(filtered['movieId'].unique())
user2idx = {uid: i for i, uid in enumerate(user_ids)}
movie2idx = {mid: i for i, mid in enumerate(movie_ids)}
﻿
filtered['user'] = filtered['userId'].map(user2idx)
filtered['movie'] = filtered['movieId'].map(movie2idx)
﻿
train_data, test_data = train_test_split(filtered, test_size=0.2, random_state=42)
﻿
train_users = torch.LongTensor(train_data['user'].values)
train_movies = torch.LongTensor(train_data['movie'].values)
train_ratings = torch.FloatTensor(train_data['rating'].values)
﻿
test_users = torch.LongTensor(test_data['user'].values)
test_movies = torch.LongTensor(test_data['movie'].values)
test_ratings = torch.FloatTensor(test_data['rating'].values)
﻿
class MF(nn.Module):
    def __init__(self, n_users, n_movies, n_factors=10):
        super().__init__()
        self.user_factors = nn.Embedding(n_users, n_factors)
        self.movie_factors = nn.Embedding(n_movies, n_factors)
﻿
    def forward(self, u, m):
        return (self.user_factors(u) * self.movie_factors(m)).sum(1)
﻿
model = MF(len(user2idx), len(movie2idx), config.embedding_dim)
optimizer = optim.Adam(model.parameters(), lr=0.005)
loss_fn = nn.MSELoss()
﻿
# Resume checkpoint if exists
start_epoch = 0
ckpt_path = "mf_checkpoint.pth"
if os.path.exists(ckpt_path):
    checkpoint = torch.load(ckpt_path)
    model.load_state_dict(checkpoint['model'])
    optimizer.load_state_dict(checkpoint['optimizer'])
    start_epoch = checkpoint['epoch']
    print(f"Resumed training from epoch {start_epoch}")
﻿
best_rmse = float('inf')
best_model_state = None
﻿
# Train for 1000 epochs
for epoch in range(start_epoch, 1000):
    model.train()
    optimizer.zero_grad()
    preds = model(train_users, train_movies)
    loss = loss_fn(preds, train_ratings)
    loss.backward()
    print(str(loss), flush=True)
    optimizer.step()
﻿
    if (epoch+1) % 10 == 0:
        model.eval()
        with torch.no_grad():
            val_preds = model(test_users, test_movies)
            val_rmse = torch.sqrt(loss_fn(val_preds, test_ratings)).item()
            wandb.log({"val_rmse": val_rmse, "epoch": epoch + 1})
﻿
            if val_rmse < best_rmse:
                best_rmse = val_rmse
                best_model_state = model.state_dict()
﻿
        print(f"Epoch {epoch+1}, Val RMSE: {val_rmse:.4f}")
﻿
    if (epoch+1) % 10 == 0:
        torch.save({
            'epoch': epoch + 1,
            'model': model.state_dict(),
            'optimizer': optimizer.state_dict()
        }, ckpt_path)
﻿
# Load best model
model.load_state_dict(best_model_state)
model.eval()
﻿
# Evaluate on 500 samples
sampled_test = test_data.sample(n=500, random_state=123)
gt, user_preds, item_preds, model_preds = [], [], [], []
﻿
for _, row in sampled_test.iterrows():
    uid, mid, rating = row['userId'], row['movieId'], row['rating']
    if uid not in user_movie_df.index or mid not in user_movie_df.columns:
        continue
﻿
    gt.append(rating)
﻿
    ub_pred = predict_user_based(uid, mid)
    ib_pred = predict_item_based(uid, mid)
﻿
    user_preds.append(ub_pred if not np.isnan(ub_pred) else 0)
    item_preds.append(ib_pred if not np.isnan(ib_pred) else 0)
﻿
    u_idx = torch.LongTensor([user2idx[uid]])
    m_idx = torch.LongTensor([movie2idx[mid]])
    with torch.no_grad():
        mb_pred = model(u_idx, m_idx).item()
    model_preds.append(mb_pred)
﻿
# RMSE calc
def rmse(y_true, y_pred): 
    return np.sqrt(mean_squared_error(y_true, y_pred))
﻿
rmse_user = rmse(gt, user_preds)
rmse_item = rmse(gt, item_preds)
rmse_model = rmse(gt, model_preds)
﻿
wandb.log({
    "final_eval/user_based_rmse": rmse_user,
    "final_eval/item_based_rmse": rmse_item,
    "final_eval/model_based_rmse": rmse_model,
    "comparison": wandb.plot.bar(
        wandb.Table(data=[
            ["User-Based", rmse_user],
            ["Item-Based", rmse_item],
            ["Model-Based", rmse_model]
        ], columns=["Method", "RMSE"]),
        "Method", "RMSE", title="Final RMSE Comparison"
    )
})
﻿
print("\n--- Final RMSE Comparison (500 test samples) ---")
print(f"User-based RMSE:  {rmse_user:.4f}")
print(f"Item-based RMSE:  {rmse_item:.4f}")
print(f"Model-based RMSE: {rmse_model:.4f}")
﻿
﻿
After running the script, we can navigate to W&B and visualize our results.
﻿
Run: mf_vs_mem_1000epoch1
﻿
As shown in the chart, the user-based collaborative filtering model achieved the lowest RMSE, outperforming both item-based and model-based methods on this test set. The model-based approach came in last, with slightly worse performance than item-based. While this might be surprising given the flexibility of learned embeddings, it reinforces the importance of empirical testing - sometimes simpler memory-based methods can still win out depending on the dataset and setup. This kind of side-by-side evaluation helps uncover what actually works in practice, not just in theory.
Conclusion Collaborative filtering remains one of the most practical and widely used approaches in recommendation systems. Its strength lies in its ability to operate without needing item features or explicit metadata - just interaction data. Memory-based methods offer transparency and simplicity, making them useful for smaller datasets or when interpretability is important. Model-based methods, especially matrix factorization, scale more effectively and can uncover structure in sparse data.
Collaborative filtering gives you multiple tools to make personalized recommendations, but choosing the right method depends heavily on your data. Memory-based methods work best when you have dense, consistent interactions, while model-based approaches handle sparsity and scale better. There's no single correct approach; getting it right involves carefully matching your dataset's characteristics with the strengths of each method. Ultimately, success in collaborative filtering comes down to clearly understanding your users, your data, and the practical trade-offs behind each modeling choice.
﻿
Add a comment
Tags: Articles, Recommender Systems, Plots, Tutorial, Intermediate
Iterate on AI agents and models faster. Try Weights & Biases today.