Skip to main content

Anomaly detection: An introduction with Python tutorial

Discover how anomaly detection algorithms like Isolation Forest, LOF, and One-Class SVM uncover unusual patterns in data to detect fraud, monitor systems, and ensure reliability across industries.
Created on October 30|Last edited on November 18
Anomaly detection is a data science technique for identifying data points, events, or patterns that deviate significantly from the norm. These unusual signals, known as anomalies, often reveal critical issues such as fraud, network threats, equipment problems, or even hidden opportunities within business data. By catching these deviations early, teams can trigger investigations, prevent losses, and respond quickly to rapidly changing risks.
To power this search for outliers, anomaly detection algorithms step in as specialized tools and mathematical methods. These algorithms analyze huge volumes of data, learn what "normal" looks like for a given process, and flag anything suspicious or unexpected.
You'll find methods ranging from decision tree models and clustering strategies to advanced machine learning approaches, such as Isolation Forest, Local Outlier Factor, and One-Class SVM. Together, these algorithms form the backbone of modern detection systems in industries such as finance, cybersecurity, and healthcare, automating the search for anomalies that would otherwise be missed.
Building an anomaly detector is just the beginning. The real test lies in whether it can effectively distinguish between the abnormal and everyday noise. And as you’ll soon discover by going through this tutorial, evaluating that performance isn’t as straightforward as just trusting the accuracy score, and you'll fully understand what is needed to achieve it.

Table of contents



When and where is anomaly detection used?

Imagine this: Your phone suddenly vibrates with a text notification -
ALERT: A $1000 debit on your Amex card ending with XXXX-1004 was blocked due to suspicious activity. Reply YES if this was you, NO to report fraud, or call 1-800-XXX-XXXX.🚨
You did not do it, nor have you shared your credit card details with anyone. Your heart skips a beat, but then you realize your bank has already flagged it and blocked the transaction.
That's anomaly detection at work, and it's quietly protecting systems all around us every single day.

Scenarios where anomaly detection is used

1. Anomaly detection in fintech

  • Credit card fraud detection: Algorithms track a customer's typical spending patterns, such as location, amount, and merchant type, and quickly flag transactions that don’t fit. For instance, a sudden large purchase abroad or multiple small debits in rapid succession often indicate fraud.
  • Market surveillance: Stock exchanges use anomaly detection to scan for "flash crash" events, front-running, or illegal insider trading, identifying patterns in trading volume or price that deviate from historical norms.
  • Risky loan application spotting: By analysing applicant data, lenders can automatically identify unusual profiles or behaviour that might signal fake information or increased default risk.

2. Anomaly detection in manufacturing/ industrial IOT

  • Predictive maintenance: Sensors measure vibration, sound, and temperature on equipment to detect potential issues. Anomaly detection algorithms alert teams when readings deviate from typical values, enabling them to forecast breakdowns before they halt production.
  • Energy monitoring: Factories can use outlier detection on power usage data to quickly identify spikes caused by faulty equipment or unauthorized access, thereby reducing accident risk and operational costs.
  • Robotic process monitoring: Sudden, unusual movements or unexpected pauses in robotic arms can indicate mechanical trouble. This early detection helps prevent accidents and prolongs equipment life.

3. Anomaly detection in cybersecurity

  • Login abuse: Unusual login patterns, such as access from unusual locations, repeated failed attempts, or logins at unusual hours, are red flags for brute-force or credential stuffing attacks.
  • Data breach prevention: If a user downloads large volumes of sensitive data or accesses files outside their authorized privileges, algorithms flag these outliers before a potential breach occurs.
  • Malware/intrusion detection: If your server’s traffic or resource consumption spikes outside normal range, it could signal a botnet attack or unauthorized resource use.

4. Anomaly detection in healthcare

  • Remote patient monitoring: Wearable devices continuously stream patient data, including heart rate, blood pressure, and oxygen levels. Anomaly detection models catch sudden or rare readings that may indicate heart attacks, respiratory distress, or equipment errors, powering timely intervention.
  • Medical imaging anomaly detection: AI models can scan X-rays, MRIs, or blood test results for minute irregularity patterns, helping radiologists or pathologists spot illnesses earlier.
  • Hospital equipment monitoring: Large outliers in device calibration data or failure rates prompt instant repairs or replacements, maintaining patient safety.

5. Anomaly detection in retail and e-commerce

  • Fake user/bot detection: During high-traffic sales or launches, spikes in activity may signal bots creating fake accounts, sniping deals, or manipulating reviews. Outlier algorithms pick up on these rapidly.
  • Returns and quality control: Sudden increases in product returns or clustered negative reviews may indicate defective batches or shifts in consumer sentiment.
  • User behaviour analytics: Unusual cart abandonment rates, or spikes in time-to-purchase, help identify UI/UX bugs or payment gateway problems.

6. Anomaly detection in social media & content platforms

  • Spam and abuse detection: Algorithms analyse posting frequency, the presence of suspicious links, and shifts in user sentiment to flag potential spam or abusive content early. These outliers often signal automated bot activity, coordinated misuse, or waves of inappropriate posts, making it easier for moderators to step in before the platform’s reputation takes a hit.
  • Fake profile/promotion: Rapid creation of new accounts, sudden surges in likes or follows, and unnatural patterns in user activity can indicate bots or fraudulent promotion schemes. Anomaly detection methods spot these trends at scale, triggering automated takedowns or review. This helps platforms maintain authentic engagement, protect real users, and reduce manipulation.

7. Anomaly detection in telecom/network monitoring

  • Quality of Service (QoS): Anomaly detection identifies unusual patterns, such as call drop rates, latency spikes, and bandwidth drops, that fall outside normal ranges, allowing for preemptive network maintenance.
  • Fraud detection: Unusual activity in SMS, international call usage, or sudden surges in data consumption can be early signs of telecom fraud, such as SIM card cloning or grey route messaging. By automatically flagging these anomalies, fraud teams can investigate fast, reduce revenue loss, and safeguard customer trust.

8. Anomaly detection in energy & utilities

  • Smart meter analytics: With anomaly detection, algorithms sift through millions of readings to flag unusual spikes, drops, or patterns in usage. Identifying these outliers can quickly reveal problems such as equipment failure, tampering, or energy theft. Early detection enables teams to dispatch maintenance crews, investigate suspicious incidents, and enhance billing accuracy. Thus, saving money for both providers and consumers.
  • Grid monitoring: Anomaly detection helps operators spot abnormal readings, such as unexpected drops, surges, or load imbalances, which are often precursors to outages, faults, or even dangerous equipment failures. By catching these anomalies in real time, it boosts grid reliability, keeps costs down, and ensures customers have a steady supply, even during demanding conditions.

How does anomaly detection work?

Anomaly Detection works by analysing data points within datasets to establish baseline patterns of normal behaviour. The system then compares new observations against these learned patterns, flagging deviations that exceed the pre-determined thresholds.
This process works on 3 primary anomaly categories:
  1. Point anomalies: It is the process of grouping nearby points into clusters to represent high-density areas on a map or in a dataset. Anomaly detection systems mark individual data points that deviate significantly from the entire dataset. E.g., Someone trying to withdraw $10,000 when they normally take out $100.
  2. Contextual anomalies: Contextual anomalies are data points that are unusual only within a specific situation or environment. Unlike point anomalies, which are always outliers, they aren’t outliers on their own but become anomalous when considered in their context, like seasonality, time of day, or location. E.g., A temperature reading of 35°C might be perfectly normal in summer, but does not make sense in winter.
  3. Collective anomalies: Collective anomalies refer to groups of data points that, when viewed together, deviate from normal patterns even though the individual points themselves may not seem unusual. These anomalies are often characterized by a sequence or pattern of behaviour over time. For example, a high frequency of small, seemingly normal transactions could suggest potential fraud, while a sustained drop in a patient’s vital signs might indicate a serious health issue. For example, consider 10-12 multiple small purchases occurring within 3-5 minutes across different cities or countries. Individually, it would be fine, but together it could be suspicious.
The magic happens when we bring machine learning into the picture. Instead of manually writing rules for every possible off-script scenario (which would take forever and miss new types of fraud), we train algorithms to learn what 'normal' looks like and flag anything that doesn't match. These models become smarter over time, adapting to new patterns and identifying anomalies that traditional rule-based systems would miss completely.

Key techniques in anomaly detection

One size doesn't fit all, and we all have options. Picking up the right approach depends on your data and what we are trying to capture. Let us understand a few approaches -

Supervised anomaly detection:

  • Requires labelled datasets containing both normal and anomalous samples
  • Trains predictive models to classify future data points based on learned patterns from historical examples
  • Common supervised algorithms generally include Support Vector Machines (SVM), K-Nearest Neighbors (KNN) classifiers.
  • If sufficient labelled data exists, this method generally achieves high accuracy.

Unsupervised anomaly detection:

  • Operates without any labelled training data
  • Makes 2 fundamental assumptions -
    • Only a small percentage of data contains anomalies
    • Anomalies exhibit statistical differences from normal samples
  • Clusters data using similarity measures, identifying points distant from cluster centres as potential anomalies.
  • If labelled anomaly examples are scarce or if we need to explore unseen anomaly types, this technique proves particularly valuable.

Semi-supervised anomaly detection :

  • A more hybrid approach, where the system combines labeled and unlabeled data, has advantages.
  • On training on partially labelled datasets, the models receive initial training guidance, then autonomously label larger datasets through some pseudo-labelling techniques.
  • Hence, getting the best of both worlds.
So which approach should you choose? If you have got a lot of labeled data and computational power, then supervised. Working with rare events and mostly unlabelled data? Unsupervised. If you want the best of both worlds? Go semi-supervised.
But generally in practice, most fraud detection systems lean heavily on unsupervised methods because patterns get continuously changed every day and such patterns need to be understood and picked up in time.

Anomaly detection algorithms

Let's have a brief idea of the algorithms that enable us to identify such odd cases. I'll not run deep with explaining these, I'll rather just point out what it is and how it helps in anomaly detection -

Isolation Forest

It's an unsupervised algorithm utilising ensemble learning principles through decision trees to detect outliers. The fundamental insight underlying Isolation Forest recognises that anomalies, lying distant from data clusters, require fewer random partitions for isolation compared to normal data points.
Here's how it works -
  • The algorithm builds a bunch of random decision trees by picking random features and random split points.
  • For each data point, it counts the number of splits required to isolate it.
  • Anomalies get isolated quickly (short path lengths) because they're hanging out far from everyone else.
  • Normal points take more splits because they're embedded in the crowd.
  • After building many such trees, the algorithm averages the path lengths, so short average paths means that it is probably an anomaly.
The beauty of Isolation Forest is its speed, as it doesn't need to calculate distances between points or fit complex distributions. It just splits randomly and counts. This makes it ideal for large datasets where other algorithms would struggle. Additionally, it handles high-dimensional data quite well, which is exactly what you need when analyzing transactions with tens of features.

Local Outlier Factor (LOF)

LOF employs density-based anomaly detection, computing anomaly scores by measuring local density deviation relative to surrounding neighbourhoods. Unlike global outlier detection methods, LOF considers both local and global density patterns, proving effective for datasets with varying density regions.
Instead of looking at the entire dataset at once, LOF asks: "How does this point's density compare to its neighbors?"
  • The algorithm finds each point's nearest neighbors, calculates the local density around that point and its neighbors, and then compares them.
  • If a point sits in a much less dense area than its neighbors, it gets a high LOF score, which means it's probably an outlier.
  • This local perspective enables LOF to catch anomalies that global methods may miss.
The catch with LOF is tuning the K-neighbors parameter. If you set it too low you'll get false positives, and if set too high you'll miss the real anomalies. Typical advice is to set the k-value somewhere between the smallest cluster size you expect and the largest anomaly group you want to catch.

One-Class SVM

One-Class SVM adapts support vector machine principles for anomaly detection, learning decision boundaries encompassing normal data points in feature space. Unlike traditional multi-class SVMs that separate different classes, One-Class SVM focuses on defining a region containing normal instances, classifying points outside this boundary as anomalies.
The nu parameter controls how tight or loose that boundary is. Basically, how paranoid you want your detector to be. Lower nu means a tighter boundary and more things flagged as anomalies.
One-Class SVM shines especially when you have complex, non-linear patterns in your data. The kernel functions let it capture such relationships that simpler methods might miss. The trade-off is the computational cost. SVMs get slower as datasets grow, so they're better for moderate-sized problems or as part of an ensemble with faster first-pass filters.
Apart from these, we also have:
  1. Autoencoders, that are neural networks that compress data into a smaller representation then reconstruct it. The idea is that normal data reconstructs cleanly while anomalies produce large reconstruction errors.
  2. DBSCAN, which uses density-based clustering to identify outliers as points in low-density regions.
  3. Decision trees can be adapted for anomaly detection, offering valuable interpretability when explaining why something was flagged.

Evaluating anomaly detection models

If you have built an anomaly detection model, amazing! But it is only half the battle. You need to know if it's actually working well, and traditional accuracy metrics can be misleading.​
Here's the problem - in anomaly detection, your classes can be wildly imbalanced. If 99% of transactions are normal and only 1% are fraudulent, a simplistic model that labels everything as “normal” would achieve 99% accuracy but would fail to detect any fraud cases. This just doesn't work out. Therefore, we need more sophisticated metrics to address this imbalance.​
Fig.1 Reference to a Confusion Matrix

Precision

Measures the proportion of flagged anomalies that represent true anomalies, calculated as TP/(TP+FP), where TP = true positives and FP = false positives.
"When the model flags something as an anomaly, how often is it actually an anomaly?" Higher precision means fewer false alarms, which is particularly important when investigation costs are high. Imagine a fraud detection team going through millions of transactions that have to manually review every flagged transaction, given that 90% of flags are false positives.

Recall

Recall or sensitivity calculates the fraction of actual anomalies successfully detected, and is computed as TP/(TP+FN), where FN = false negatives.
In simpler English -"Of all the actual anomalies, how many did the model catch?" A high recall value means you're not missing any true fraud cases. In applications where missing an anomaly can be catastrophic, like detecting equipment failures in nuclear plants or similar, you want the recall value to be extremely high, even if it means investigating more false positives.​

F1-Score

The F1-Score balances precision and recall through its harmonic mean: 2×(Precision×Recall)/(Precision+Recall). This metric provides a single performance measure, particularly useful when optimizing models on imbalanced datasets. This is super useful when comparing different models or tuning hyperparameters.​

ROC-AUC and PR-AUC

The ROC-AUC (Receiver Operating Characteristic-Area Under Curve) visualizes the trade-off between TP and FP rates across classification thresholds, offering aggregate performance views.
However, for highly imbalanced datasets(which anomaly detection generally has), Precision-Recall AUC proves more informative by focusing specifically on the minority anomaly class.

Threshold optimisation

Threshold selection warrants special attention because it has a direct impact on operational performance.
This significantly impacts model performance by determining the anomaly score boundary separating normal from anomalous predictions. Methods include Youden's Index (maximizing the difference between true positive and false positive rates), F1-Score Maximization (selecting thresholds that achieve optimal precision-recall balance), and cost-sensitive thresholds (incorporating business costs of false positives versus false negatives).
Every model outputs some kind of score, and you need to pick a threshold that separates normal from anomaly. If you set it too conservatively, you'll miss real fraud; if you set it too aggressively, you'll drown investigators in false alarms. Smart approaches consider business costs, such as the financial impact of missing fraud versus investigating a false positive. and then optimise accordingly.​

Observability tools

This is where W&B Weave becomes incredibly valuable. Instead of just seeing aggregate metrics, Weave gives you complete visibility into individual predictions, model behavior, and performance trends.
You can trace exactly why a specific transaction was flagged, see how performance changes over time, compare different model versions side-by-side, and identify patterns in your errors. This observability transforms model evaluation from some guesswork to proper data-driven decision-making.​

Challenges

Fig. 2 Challenges in Implementing Anomaly Detection Systems

Issue #1: Data quality issues

Effects:
  • A model may learn from bad data, leading to unreliable predictions
  • Increased false positives/negatives
  • The model will break when the data structure or scale is changed unexpectedly
Combat Techniques:
  • Build strong data validation pipelines (catching missing values, duplicates, incorrect types, etc.)
  • Monitor data quality continuously, not just once before training
  • Use observability tools like W&B Weave to track and visualise dataset changes over time
  • Set up alerts for sudden shifts in data distribution so you can fix problems early

Issue #2: Class imbalance

Effects:
  • Models can become lazy, always predicting the majority class (usually 'normal')
  • Real anomalies get missed, lowering recall dramatically
  • The accuracy metric becomes misleading since anomalies usually are <1% of the data
Combat Techniques:
  • Use algorithms built for rare event detection, like Isolation Forest, One-Class SVM, etc.
  • Try resampling techniques like SMOTE by oversampling the rare class, or under-sample the majority class.
  • Weight your loss function to penalise missing anomalies more heavily (cost-sensitive learning)
  • Focus on precision, recall, and F1-score, not only on vanilla accuracy.

Issue #3: Model overfitting & underfitting

Effects:
  • Overfitting: Model memorises noise instead of useful patterns, failing on new data
  • Underfitting: Model too simple, missing important signals and anomalies
Combat Techniques:
  • Use regularization (L1/L2) to avoid overfitting
  • Apply cross-validation to check model strength on unseen data
  • Try ensemble methods (like multiple random forests) to boost performance
  • Monitor training and validation loss/metrics to stay ahead of both problems
  • For certain anomaly problems, very mild overfitting may actually help find subtle outliers, but try not to go overboard with it.

Issue #4: Concept drift

Effects:
  • Model performance drops as real-world data changes
  • New fraud tactics, changing customer behaviour, or equipment wear-and-tear can confuse the model
  • The "good" models slowly become useless
Combat Techniques:
  • Continuously monitor for performance drops and data distribution shifts
  • Automate retraining pipelines so models stay up-to-date
  • Track metrics and data drift with monitoring tools get alerted before a disaster
  • Validate model on recent/rolling data as well as old test sets

Issue #5: Threshold tuning

Effects:
  • If the threshold is too high: Missed anomalies (false negatives)
  • If the threshold is too low: Flood of false positives, swamping investigators
Combat Techniques:
  • Set thresholds based on your requirements, i.e, how many alerts can your team handle?
  • Use ROC analysis, Youden’s Index, or precision-recall curve for data-based thresholding
  • Continuously monitor and adjust thresholds in production as data evolves
  • Use observability tools to analyse how predictions change at different thresholds

Building a financial fraud detection system in Python

Time to get hands-on!
We'll build a complete fraud detection system using Isolation Forest and log everything with W&B Weave for maximum observability.
# Install all the required dependencies
!pip install weave wandb scikit-learn pandas numpy matplotlib seaborn
Now we'll import all the necessary libraries -
import weave
import wandb
import pandas as pd
import numpy as np

from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor
from sklearn.svm import OneClassSVM
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import (precision_score, recall_score, f1_score,
classification_report, roc_auc_score, confusion_matrix)

import matplotlib.pyplot as plt
import seaborn as sns
Replace your-entity and fraud-detection with your W&B entity and project name. We are here initialising both Weave and WandB for the project -
# Initialise Weave with your project
weave.init("your-entity/fraud-detection")

# Also initialise standard wandb for model tracking
wandb.init(
project="fraud-detection-tutorial",
name="isolation-forest-experiment",
config={
"algorithm": "IsolationForest",
"contamination": 0.01,
"n_estimators": 100,
"random_state": 100
}
)
On executing this, you would see the following -



Generating the dataset

Now that your WandB and Weave are initialized, it's time to get the dataset to train the model. To reduce dependency in case the dataset gets deleted, I’ve written a function to generate a randomized dataset instead of picking one from Kaggle or a similar source.
We’ll first create two sets of data: one set of normal transactions and another of fraudulent ones.
Each set will have columns for -
  • amount,
  • daily frequency,
  • average amount in the last 30 days,
  • time since the last transaction,
  • merchant type, and
  • whether it was fraudulent.
After creation, we’ll concatenate the two sets and shuffle them to ensure random distribution for splitting.
@weave.op()
def generate_financialdata(n_normal=9900,
n_fraud=100,
random_seed=100):
np.random.seed(random_seed)
# Normal transactions with typical patterns
normal_data = pd.DataFrame({
'amount': np.random.normal(100, 50, n_normal),
'daily_frequency': np.random.poisson(3, n_normal),
'avg_amount_30d': np.random.normal(95, 30, n_normal),
'time_since_last': np.random.exponential(10, n_normal),
'merchant_type': np.random.choice(['retail', 'food', 'gas', 'online'], n_normal),
'is_fraud': 0
})
# Fraudulent transactions with unusual patterns
fraud_data = pd.DataFrame({
'amount': np.random.normal(500, 200, n_fraud), # Much higher amounts
'daily_frequency': np.random.poisson(8, n_fraud), # Way more frequent
'avg_amount_30d': np.random.normal(450, 150, n_fraud),
'time_since_last': np.random.exponential(2, n_fraud), # Rapid succession
'merchant_type': np.random.choice(['retail', 'food', 'gas', 'online'], n_fraud),
'is_fraud': 1
})
# Combining and shuffling to mix things up
df = pd.concat([normal_data, fraud_data], ignore_index=True)
df = df.sample(frac=1, random_state=random_seed).reset_index(drop=True)
print(f"Generated {len(df)} transactions ({df['is_fraud'].sum()} fraudulent)")
print(f"Fraud rate: {df['is_fraud'].mean():.2%}")
return df

df = generate_financialdata()
Tracking TRACE log on Weave
Checking the first 5 entries of the generated dataframe using df.head(), it will show something like this -

Let us now understand what we need to do next -
  1. We'll convert the merchant_type categorical feature into a numerical format using one-hot encoding (OHE).
  2. Split the dataset into a standard 80-20 ratio.
  3. Standardize the numerical features to ensure that they all have a similar scale using StandardScaler().
  4. Log parameters in the WandB project.
  5. Execute the function
We’ll wrap all these steps inside a function to log the entire process in Weave. Alternatively, you can execute each block inside a separate cell.
@weave.op()
def preprocess_transactions(df):
df_ohe= pd.get_dummies(df, columns=['merchant_type'], drop_first=True)
X = df_ohe.drop('is_fraud', axis=1)
y = df_ohe['is_fraud']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=100, stratify=y)

#Standardise features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

#Logging
wandb.log({
"train_size": len(X_train),
"test_size": len(X_test),
"train_fraud_rate": y_train.mean(),
"test_fraud_rate": y_test.mean(),
"n_features": X_train.shape[1]
})

print(f"Preprocessing is complete!")
print(f"Train set size: {len(X_train)}")
print(f"Test set size: {len(X_test)}")
print("Features: {X_train.columns[1]}")

return X_train_scaled, X_test_scaled, y_train, y_test, scaler

X_train_scaled, X_test_scaled, y_train, y_test, scaler = preprocess_transactions(df)
Output :


Let's now train the model with Isolation Forest

@weave.op()
def train_isolation_forest(X_train, contamination=0.01,
n_estimators=100, random_state=100):

print(f"Training isolation forest with {n_estimators} trees")
model = IsolationForest(n_estimators=n_estimators,
contamination=contamination,
random_state=random_state,
max_samples='auto',
verbose=0)
model.fit(X_train)
print(f"Training complete!")
return model

isoforest_model = train_isolation_forest(X_train_scaled, contamination=0.01)
Now that we have a trained Isolation Forest model, we can use it to identify potential fraudulent transactions in our test dataset.
The next steps involve:
  1. Using the trained model to predict anomalies in the test data.
  2. Obtaining anomaly scores for each transaction, which indicate how "unusual" a transaction is.
  3. Converting the model's raw output into a clear binary classification (fraud/normal).
  4. Counting and reporting the number of transactions flagged as potential fraud.
@weave.op()
def detect_fraud(model, X_test):

raw_predictions = model.predict(X_test)
anomaly_scores = model.score_samples(X_test)
predictions = np.where(raw_predictions == -1, 1, 0)
print(f"Detected {predictions.sum()} potential fraud cases")
return predictions, anomaly_scores

# Run detection with full tracing
y_pred, scores = detect_fraud(isoforest_model, X_test_scaled)
Note:
  • For Isolation Forest, the model.predict() method returns -1 for anomalies (potential fraud) and 1 for normal data points.
  • np.where(raw_predictions == -1, 1, 0) converts the raw predictions (-1 and 1) into a more intuitive binary format (1 for fraud, 0 for normal). You can also use a simple if-else condition to categorise this.
Output :

It appears that it has detected 21 potential fraud cases. Let's evaluate the model and look at the metrics:
@weave.op()
def evaluate_frauddetector(y_true, y_pred,scores):
precision = precision_score(y_true, y_pred, zero_division=0)
recall = recall_score(y_true, y_pred, zero_division=0)
f1 = f1_score(y_true, y_pred, zero_division=0)
roc_auc = roc_auc_score(y_true, -scores)
metrics = {
"precision": precision,
"recall": recall,
"f1_score": f1,
"roc_auc": roc_auc
}
#Logging to WandB
wandb.log(metrics)
print("\nClassification Report:")
print(classification_report(y_true, y_pred,
target_names=['Normal', 'Fraud'],
digits=3))
return metrics

metrics = evaluate_frauddetector(y_test, y_pred, scores)
Output:
Classification Report for Anomaly Detection using Isolation Forest
Function logging into Weave


Visualisation

On visualizing the anomaly score distribution, confusion matrix, ROC Curve, and the PR Curve, we get:
@weave.op()
def create_visualizations(y_true, y_pred, scores):
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# 1. Anomaly Score Distribution
axes[0, 0].hist(scores[y_true == 0], bins=50, alpha=0.6,
label='Normal Transactions', color='#3498db')
axes[0, 0].hist(scores[y_true == 1], bins=50, alpha=0.6,
label='Fraudulent Transactions', color='#e74c3c')
axes[0, 0].set_xlabel('Anomaly Score', fontsize=11)
axes[0, 0].set_ylabel('Count', fontsize=11)
axes[0, 0].set_title('Anomaly Score Distribution', fontsize=12, fontweight='bold')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

# 2. Confusion Matrix
cm = confusion_matrix(y_true, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='RdYlGn_r', ax=axes[0, 1],
xticklabels=['Normal', 'Fraud'],
yticklabels=['Normal', 'Fraud'],
cbar_kws={'label': 'Count'})
axes[0, 1].set_title('Confusion Matrix', fontsize=12, fontweight='bold')
axes[0, 1].set_ylabel('True Label', fontsize=11)
axes[0, 1].set_xlabel('Predicted Label', fontsize=11)

# 3. ROC Curve
from sklearn.metrics import roc_curve
fpr, tpr, _ = roc_curve(y_true, -scores)
axes[1, 0].plot(fpr, tpr, linewidth=2.5, color='#2ecc71',
label=f'Model (AUC = {metrics["roc_auc"]:.3f})')
axes[1, 0].plot([0, 1], [0, 1], 'k--', linewidth=1.5,
label='Random Classifier', alpha=0.5)
axes[1, 0].set_xlabel('False Positive Rate', fontsize=11)
axes[1, 0].set_ylabel('True Positive Rate', fontsize=11)
axes[1, 0].set_title('ROC Curve', fontsize=12, fontweight='bold')
axes[1, 0].legend(loc='lower right')
axes[1, 0].grid(True, alpha=0.3)

# 4. Precision-Recall Curve
from sklearn.metrics import precision_recall_curve
precision_vals, recall_vals, _ = precision_recall_curve(y_true, -scores)
axes[1, 1].plot(recall_vals, precision_vals, linewidth=2.5, color='#9b59b6')
axes[1, 1].set_xlabel('Recall', fontsize=11)
axes[1, 1].set_ylabel('Precision', fontsize=11)
axes[1, 1].set_title('Precision-Recall Curve', fontsize=12, fontweight='bold')
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()

# Logging to wandb
wandb.log({"fraud_detection_analysis": wandb.Image(fig)})
plt.show()

create_visualizations(y_test, y_pred, scores)

I understand that ROC and PR curves are ideal graphs, so they may not be the best choice for actual data. However, since this is an example, we’ll assume they apply here.

Comparing isolation forest vs. local outlier factor(LOF) vs. one-class support vector machines (SVM)

@weave.op()
def compare_algorithms(X_train, X_test, y_train, y_test):

algorithms = {
'Isolation Forest': IsolationForest(
n_estimators=100, contamination=0.01, random_state=100),
'Local Outlier Factor': LocalOutlierFactor(
n_neighbors=20, contamination=0.01, novelty=True),
'One-Class SVM': OneClassSVM(
kernel='rbf', gamma='auto', nu=0.01)
}

results = []
print("Comparing Algorithms...")
for name, model in algorithms.items():
print(f"Training {name}...")

model.fit(X_train) #Train

# Predict
raw_pred = model.predict(X_test)
scores_test = model.score_samples(X_test)
y_pred_algo = np.where(raw_pred == -1, 1, 0)

# Evaluate
precision = precision_score(y_test, y_pred_algo, zero_division=0)
recall = recall_score(y_test, y_pred_algo, zero_division=0)
f1 = f1_score(y_test, y_pred_algo, zero_division=0)
roc_auc = roc_auc_score(y_test, -scores_test)

results.append({
'Algorithm': name,
'Precision': f"{precision:.3f}",
'Recall': f"{recall:.3f}",
'F1-Score': f"{f1:.3f}",
'ROC-AUC': f"{roc_auc:.3f}"
})

print(f"Precision: {precision:.3f}, Recall: {recall:.3f}, F1: {f1:.3f}\n")

results_df = pd.DataFrame(results)

# Log comparison table
wandb.log({"algorithm_comparison": wandb.Table(dataframe=results_df)})
print(results_df.to_string(index=False))
return results_df

# Compare all algorithms
comparison = compare_algorithms(X_train_scaled, X_test_scaled, y_train, y_test)
Output :

Inference from the above output:
  • Isolation Forest performed extremely well, catching most anomalies and making very few mistakes. Its ROC-AUC of 1.0 means a perfect separation between normal and anomalous data in the test set.
  • LOF struggled on this dataset. Most flagged anomalies were not actually anomalous and it missed many real ones. Its ROC-AUC shows weak discrimination of around 0.594.
  • One-Class SVM proved to be better than LOF, but still not as strong as Isolation Forest. It caught more anomalies but precision was lower than Isolation Forest.
Once you go to a similar Weave link generated as in the above output snip, you'll be able to see and store something like this:
Logging compare_algorithms()
runs.summary["algorithm_comparison"] in Workspace

Production monitoring

This production_monitoring() function demonstrates how to set up a system to detect these changes by continuously analysing the model's output on new data when it is launched into production.
If you are doing a hobby project or similar for your own purpose or if you are a student, you might get free credits on GCP, AWS or Azure Cloud. Utilise those credits and try hosting your model on cloud where you can get hands on to monitor using this.
💡
We check if the alert level is high or low based on the current alert level and compare it with the threshold. Here we are setting the alert_threshold to 0.05.
Based on the severity of the log message, you can immediately identify and run an automation job to send an alert via notification to the user or escalate to support executives to monitor and intervene.
@weave.op()
def production_monitoring(model, new_data,
scaler, alert_threshold=0.05):

# Preprocess
X_new_processed = pd.get_dummies(
new_data.drop('is_fraud', axis=1, errors='ignore'),
drop_first=True
)
X_new_scaled = scaler.transform(X_new_processed)

# Detect
predictions = model.predict(X_new_scaled)
scores = model.score_samples(X_new_scaled)

# Calculate monitoring metrics
anomaly_rate = np.sum(predictions == -1) / len(predictions)
avg_score = np.mean(scores)
flagged_count = np.sum(predictions == -1)

monitoring_data = {
"timestamp": pd.Timestamp.now(),
"total_transactions": len(predictions),
"flagged_fraud": int(flagged_count), # Convert to standard int
"anomaly_rate": float(anomaly_rate), # Convert to standard float
"avg_anomaly_score": float(avg_score), # Convert to standard float
"score_std_dev": float(np.std(scores)) # Convert to standard float
}

# Logging to wandb
wandb.log(monitoring_data)
print(f" Transactions processed: {len(predictions)}")
print(f" Flagged as fraud: {flagged_count} ({anomaly_rate:.2%})")
print(f" Average anomaly score: {avg_score:.4f}")

# Alert if anomaly rate is too high
if anomaly_rate > alert_threshold:
warning_msg = f"\n HIGH ANOMALY RATE: {anomaly_rate:.2%} exceeds threshold of {alert_threshold:.2%}!"
print(f"\n{warning_msg}")

wandb.alert(
title="\n High Anomaly Rate Detected",
text=warning_msg,
level=wandb.AlertLevel.WARN
)
else:
print(f"\n Anomaly rate within normal range")

return monitoring_data

batch_results = production_monitoring(isoforest_model, df.sample(100), scaler)
batch_results
Output:
Hey ! We got 1 fraud flag on 100 transaction runs.








Wrapping up

Anomaly detection, a field at the crossroads of mathematics, machine learning, and real-world problem-solving, plays a crucial role in safeguarding your credit card, monitoring data centers, preventing costly manufacturing defects, and identifying cyber threats before they escalate into breaches. We explored 3 algorithms:
  • Isolation Forest with its clever path-length approach,
  • Local Outlier Factor with its neighborhood density analysis, and
  • One-Class SVM with its decision boundary learning
Each offering unique strengths suited to different scenarios. There’s no single 'best' algorithm; the choice depends on your data characteristics, computational constraints, and detection requirements. Modern anomaly detection’s true power lies in the observability. W&B Weave transforms opaque models into transparent systems, enabling you to trace predictions, understand decisions, and monitor drifts. The @weave.op() decorator, although it looks quite simple, enables complete visibility into your machine learning pipeline without cluttering your code with logging initialisations. When building your own anomaly detection systems, remember that the technical algorithm is only part of the equation. Success hinges on clean data pipelines, thoughtful threshold tuning, continuous monitoring for concept drift, and observability tools to identify and address problems before they impact production. The distinction between a research experiment and a production system is not only performance metrics, but also reliability, maintainability, and visibility into actual operations. The financial fraud detection example we built illustrates these features in action. From data generation and preprocessing to training, evaluation, and production monitoring, every step is traceable and debuggable. This observability accelerates development, simplifies troubleshooting, and builds stakeholder confidence in your AI systems. Regardless of algorithmic sophistication, the fundamental principles remain:
  • Understand your data
  • Select appropriate techniques
  • Rigorously evaluate and maintain visibility into model behaviour.
Tools like W&B Weave make this last principle achievable for individuals or teams of any size.
Iterate on AI agents and models faster. Try Weights & Biases today.