Brett Young - Machine Learning Portfolio in Weights & Biases

Skip to main content

Reports

Evaluating o4-mini vs. Claude 3.7 vs. Gemini 2.5 Pro on code generation

A real-world head-to-head test of Gemini 2.5 Pro, o4-mini, and Claude 3.7 Sonnet on competitive programming problems—built on a custom execution framework with Weave integration to track correctness, spot bugs, and cut through benchmark hype.

Last edit 5 months ago

Building a Github repo summarizer with CrewAI

A hands-on guide to building a fully automated GitHub documentation system using CrewAI for multi-agent coordination and Weave for real-time debugging and observability.

Last edit 5 months ago

How to fine-tune and evaluate Qwen3 with Unsloth

This article provides a comprehensive guide to fine-tuning, evaluating, and deploying the Qwen3 language model, emphasizing its flexibility, performance, and unique reasoning-toggle feature.

Last edit 5 months ago

The Model Context Protocol (MCP): A guide for AI integration

This guide explores how MCP standardizes AI interactions with external tools and data sources, enabling more efficient AI context integrations.

Last edit 7 months ago

Training GPT-4o to reason: Fine-tuning vs budget forcing

Can fine-tuning and budget forcing improve GPT-4o’s reasoning? We test structured datasets and inference-time techniques to boost multi-step problem-solving.

Last edit 8 months ago

Budget forcing s1-32B: Waiting is all you need?

We test whether budget forcing - a simple test-time intervention - can significantly boost the reasoning accuracy of s1-32B, potentially enabling smaller models to rival closed-source giants like OpenAI's o1-preview.

Last edit 8 months ago

DeepSeek-R1 vs OpenAI o1: A guide to reasoning model setup and evaluation

Discover the capabilities of DeepSeek-R1 and OpenAI o1 models for reasoning and decision-making. Includes setup guides, API usage, local deployment, and Weave-powered comparisons.

Last edit 9 months ago

How to fine-tune a large language model (LLM)

Discover the process of fine-tuning large language models (LLMs) to enhance their performance for specific tasks or domains. Learn about methods, best practices, and challenges associated with LLM fine-tuning.

Last edit 10 months ago

How to train and evaluate an LLM router

This tutorial explores LLM routers, inspired by the RouteLLM paper, covering training, evaluation, and practical use cases for managing LLMs effectively.

Last edit 1 year ago

Grokking: Improved generalization through over-overfitting

One of the most mysterious phenomena in deep learning; Grokking is the tendency of neural networks to improve generalization by sustained overfitting.

Last edit 1 year ago

Training a KANFormer: KAN's Are All You Need?

We will dive into a new experimental architecture, replacing the MLP layers in transformers with KAN layers!

Last edit 1 year ago

Knowledge distillation: Teaching LLM's with synthetic data

Unlock the power of knowledge distillation by learning how to efficiently transfer complex model insights from teacher to student models, step by step.

Last edit 1 year ago

How to fine-tune Phi-3 Vision on a custom dataset

Here's how to fine tune a state of the art multimodal LLM on a custom dataset

Last edit 1 year ago

Building a real-time answer engine with Llama 3.1 405B and W&B Weave

Infusing llama 3.1 405B with internet search capabilities!!

Last edit 1 year ago

YOLOv9 object detection tutorial

How to use one of the worlds fastest and most accurate object detectors to run inference, display on your webcam using OpenCV and tracking your results.

Last edit 1 year ago

Fine-Tuning Llama-3 with LoRA: TorchTune vs HuggingFace

A battle between the HuggingFace and TorchTune!!!

Last edit 11 months ago

Building an LLM Python debugger agent with the new Claude 3.5 Sonnet

Building a AI powered coding agent with Claude 3.5 Sonnet!

Last edit 7 months ago

o3-mini vs. DeepSeek-R1: API setup, performance testing & model evaluation

Learn how to set up and run OpenAI o3-mini via the API, explore its flexible reasoning effort settings, and compare its performance against DeepSeek-R1 using W&B Weave Evaluations.

Last edit 8 months ago

Building the worlds fastest chatbot with the Cerebras Cloud API and W&B Weave

A guide to getting started using the Cerebras Cloud API with W&B Weave.

Last edit 1 year ago

6 "gotchas" in machine learning—and how to avoid them

ML is hard and you can't plan for everything. Here are a few things I've learned and a few tips to avoid common missteps

Last edit 1 year ago

Building a RAG-Based Digital Restaurant Menu with LlamaIndex and W&B Weave

Powered by RAG, we will transform the traditional restaurant PDF menu into an AI powered interactive menu!

Last edit 7 months ago

Fine-Tuning Mistral7B on Python Code With A Single GPU!

A tutorial for fine-tuning Mistral7B on Python Code using a single GPU!

Last edit 2 years ago

How to Fine-Tune LLaVA on a Custom Dataset

A tutorial for fine-tuning LLaVA on your own data!

Last edit 1 year ago

A Guide to DeepSpeed Zero With the HuggingFace Trainer

A guide for making the most out of your GPU's!

Last edit 1 year ago

How to Run Mistral-7B on an M1 Mac With Ollama

Ever wanted to run Mistral 7B on your Macbook? In this tutorial I show you how!

Last edit 1 year ago

New Method For LLM Quantization

A new quantization method that plays nicely with CUDA

Last edit 2 years ago

AI Expert Speculates on GPT-4 Architecture

George Hotz offers his thoughts on the secretive architecture of OpenAI's GPT-4

Last edit 2 years ago

Scaling Llama 2 to 32k Tokens With LongLora

The need for LLMs that can digest long content is becoming increasingly more important. Go beyond 4096 tokens with LongLora!

Last edit 1 year ago

Testing Mistral 7B vs. Zephyr 7B on HumanEval: Which Model Writes Better Code?

Putting some of the best 7B parameter models to the test on the HumanEval benchmark!

Last edit 1 year ago

Testing Mixtral 8x7B with MMLU and W&B

There's a new LLM on the block, and it isn't from OpenAI! In this article, we run Mixtral 8x7B through its paces with the MMLU dataset and Weights & Biases.

Last edit 1 year ago

A Gentle Introduction to Diffusion

Powering some of the world's most advanced image generation models, we will dive into the theory and code of how diffusion models work, and how to train one of our own.

Last edit 1 year ago

Fine-Tuning a Legal Copilot Using Azure OpenAI and W&B

Building a tool to demystify the intricacies of legal contracts with AI!

Last edit 1 year ago

Building a Virtual Assistant with Google Gemini Function Calling

This article delves into the challenges and solutions for enabling Google Gemini to provide up-to-date information, such as sports schedules, by interfacing with external APIs.

Last edit 1 year ago

Skin Lesion Classification on HAM10000 with HuggingFace using PyTorch and W&B

Explore the use of HuggingFace, PyTorch, and W&B for classifying skin lesions with the HAM10000 dataset. We will build, train, and evaluate models for medical diagnostics!

Last edit 1 year ago

Creating a predictive models to assess the risk of mortgage clients

My top tips for competing in Kaggle Challenges like the Home Credit Risk Model Stability Challenge.

Last edit 1 year ago

Getting Started with Deep Q-Learning

Diving deep into the world of RL, we will unpack all of the details of deep Q-learning, and train a model on the OpenAI Gym Cartpole Environment!

Last edit 1 year ago

Some Details on OpenAI's Sora and Diffusion Transformers

Combining some of the most promising architectures, OpenAI shows off its newest model!

Last edit 1 year ago

I put GPT2-chatbot’s coding skills to the test

A new model has shown up on lmsys, and it looks a lot like GPT-4!

Last edit 1 year ago

Leveraging synthetic data for tabular financial fraud detection

A guide on overcoming data scarcity for fraud detection

Last edit 1 year ago

Mastering K-Fold Cross-Validation

A guide on K-Fold cross validation, a fundamental evaluation method in machine learning .

Last edit 1 year ago

Creating videos from static images with Stable Video Diffusion

The model known for generating images has been upgraded to handle video! We will cover the basics of the model, and also generate some sample videos!

Last edit 1 year ago

A survey of financial datasets for machine learning

An overview of popular datasets used for ML in finance!

Last edit 1 year ago

Introducing VisionLLM: A New Method for Multi-Modal LLM’s

Humans are multimodal — so shouldn't AI be the same? Here, we discuss the future of vision models, and the recent progress that's been made in multi-modal LLMs.

Last edit 2 years ago

Claude 3.5 Sonnet on Vertex AI: Python quickstart

Here's how to get up and running with the newest model from Anthropic

Last edit 11 months ago

Self-Supervised Image Recognition with IJEPA

Last edit 1 year ago

How to handle training divergences with our new rewind feature

Easily handle common LLM training failures with Weights & Biases rewind!

Last edit 1 year ago

What is Backpropagation?

A guide to one of the most fundamental tools in machine learning.

Last edit 1 year ago

Getting started fine-tuning with the Mistral API

How to fine-tune Mistral-Small using the Mistral API and W&B Weave.

Last edit 1 year ago

How to fine-tune YOLO V9 on a custom dataset with W&B

Fine tuning one of the best detection models!

Last edit 1 year ago

A guide to using the Azure AI model inference API

Using Microsoft Azure AI model inference API for serverless inference!

Last edit 1 year ago

Supercharging LLM summarization

A guide to making the most of LLMs for summarization tasks

Last edit 1 year ago

Vision fine-tuning GPT-4o on a custom dataset

Learn to vision fine-tune GPT-4o on a custom dataset, with evaluation and tracking.

Last edit 1 year ago

Building reliable apps with GPT-4o and structured outputs

Learn how to enforce consistency on GPT-4o outputs, and build reliable Gen-AI Apps.

Last edit 1 year ago

How to evaluate a Langchain RAG system with RAGAs

A guide to evaluating a Langchain RAG system with RAGAs and Weights & Biases.

Last edit 11 months ago

Getting started with Apple MLX

A guide to Apple's new deep learning framework. It is faster than Torch on the M1?

Last edit 11 months ago

PHI and PII for healthcare in the world of AI

A practical guide on working with health data, safely, with multiple approaches for handling PHI

Last edit 1 year ago

Building and evaluating a RAG system with DSPy and W&B Weave

A guide to building a RAG system with DSPy, and evaluating it with W&B Weave.

Last edit 11 months ago

Working with Pixtral Large for visual chart understanding

A battle between Open Source Pixtral Large and closed source foundation models like Claude 3.5 Sonnet and GPT-4o Vision

Last edit 11 months ago

Evaluating LLMs on Amazon Bedrock

Discover how to use Amazon Bedrock in combination with W&B Weave to evaluate and compare Large Language Models (LLMs) for summarization tasks, leveraging Bedrock’s managed infrastructure and Weave’s advanced evaluation features.  

Last edit 7 months ago

LLaVA-o1: Advancing structured reasoning in vision-language models

Discover how LLaVA-o1 tackles reasoning challenges in multimodal AI with structured problem-solving. Learn about its dataset, capabilities, and performance analysis using W&B Weave.

Last edit 10 months ago

Comparing GPT Models on Azure AI Foundry with W&B Weave

Learn how to compare and evaluate OpenAI’s GPT models on Azure with W&B Weave on text summarization tasks, leveraging Azure’s managed infrastructure and Weave’s customizable evaluation tools.

Last edit 10 months ago

Combining open-source PII redaction with closed-model analysis in healthcare using Llama 3.1, MedSpacy and GPT-4o

A guide to PII redaction with AI, covering open-source tools, proprietary models, HIPAA compliance, and how logging supports secure data handling.

Last edit 10 months ago

Securing your LLM applications against prompt injection attacks

We will focus on understanding prompt injection attacks in AI systems and explore effective strategies to prevent against them!

Last edit 10 months ago

GraphRAG: Enhancing LLMs with knowledge graphs for superior retrieval

This article introduces GraphRAG, a novel approach that combines knowledge graphs and hierarchical community detection to enable scalable, query-focused summarization and global sensemaking over large datasets.

Last edit 10 months ago

Build a reliable GenAI search system with Gemini Grounding and Vertex AI

Boost generative AI accuracy with Gemini Grounding in Vertex AI. Learn data integration, custom grounding, and W&B Weave logging in this tutorial

Last edit 10 months ago

LLM Evaluation on Google Vertex AI

This guide explores large language model (LLM) evaluation with Google Vertex AI and W&B Weave, focusing on comparing different Gemini models for text summarization!

Last edit 7 months ago

AI guardrails: Understanding PII detection

This article highlights the importance of PII, detection methods like regex, Presidio, and transformers, and evaluation with Weave to ensure accurate and adaptable data protection.

Last edit 7 months ago

AI scorers: Evaluating AI-generated text with ROUGE

This article explores the ROUGE (Recall-Oriented Understudy for Gisting Evaluation) metric, a powerful tool used for evaluating the quality of AI-generated text

Last edit 7 months ago

AI scorers: Evaluating AI-generated text with BLEU

This article breaks down BLEU, a key metric for evaluating machine-generated text, covering its mechanics, practical applications with Python and Weave, and its role in improving text generation systems.

Last edit 7 months ago

AI guardrails: Bias scorers

This article explores bias in AI systems, the need for bias guardrails, detection models, and strategies to mitigate, monitor, and evaluate bias effectively.

Last edit 7 months ago

AI guardrails: Toxicity scorers

This article explores the challenges of detecting and managing toxicity in AI systems, providing actionable strategies and tools to foster safer and more inclusive digital interactions.

Last edit 7 months ago

AI Guardrails: Coherence scorers

Coherence, a measure of clarity and logical consistency in AI-generated responses, is effectively evaluated and refined using Weave's comprehensive tools and comparison insights.

Last edit 7 months ago

Monitoring Amazon Bedrock Agents with W&B Weave

Learn to build and monitor powerful AI agents with Amazon Bedrock and W&B Weave for automated workflows.

Last edit 7 months ago

S1: Achieving Test-Time Scaling with Just 1,000 Examples

Last edit 8 months ago

Agentic workflows: Getting started with AI Agents

Explore AI agent workflows for automating tasks with multi-agent systems and generative AI, including a tutorial to build a research assistant for AI summaries.

Last edit 9 months ago

Exploring multi-agent AI systems

This project explores multi-agent AI systems, examining how multiple specialized agents collaborate to enhance decision-making, problem-solving, and automation across various domains.

Last edit 7 months ago

Tutorial: Building AI agents with CrewAI

This guide explores how AI agents, powered by CrewAI, automate complex tasks with minimal human input by integrating adaptive workflows, real-time data analysis, and iterative improvements.

Last edit 7 months ago

Autonomous AI Agents: Capabilities, challenges, and future trends

Learn how autonomous AI agents automate tasks with minimal supervision, their architecture, applications, risks, and how to build a HackerNews AI news reporter.

Last edit 7 months ago

AI agents in retail and e-commerce

This article explores how AI agents are transforming retail by automating customer interactions, optimizing decision-making, and enhancing product recommendations using LLM-driven vector search.

Last edit 6 months ago

Agentic RAG: Enhancing retrieval-augmented generation with AI agents

This article explores how agentic RAG enhances retrieval-augmented generation by using AI agents to dynamically refine search strategies, coordinate multiple data sources, and improve response accuracy.

Last edit 7 months ago

Evaluating Claude 3.7 Sonnet: Performance, reasoning, and cost optimization

Experimenting with Anthropic's new flagship LLM, Claude 3.7 Sonnet!

Last edit 7 months ago

Evaluating the new Gemini 2.5 Pro Experimental model

Gemini 2.5 Pro Experimental is Google's most advanced AI model to date, featuring multimodal input support, a massive 1 million-token context window, and the ability to solve complex problems.

Last edit 6 months ago

Getting Started with MCP using OpenAI Agents

A practical walkthrough for building OpenAI Agents that use the Model Context Protocol (MCP) to access tools, files, and trace data via Weave.

Last edit 6 months ago

Quantization-Aware Training (QAT): A step-by-step guide with PyTorch

A practical deep dive into quantization-aware training, covering how it works, why it matters, and how to implement it end-to-end.

Last edit 6 months ago

Running inference and evaluating Llama 4 in Python

Deploy Llama 4 locally or via API with Python scripts. We test multimodal performance against GPT-4o on ChartQA and show how to debug and compare results using Weave.

Last edit 6 months ago

What is Retrieval Augmented Thinking (RAT) and how does it work?

Retrieval Augmented Thinking (RAT) separates AI reasoning from response generation, improving efficiency, interpretability, and customization by using one model for structured thought and another for the final output.

Last edit 6 months ago

Recommendation systems with collaborative filtering to accelerate time to market

A hands-on guide to building and comparing memory-based and model-based collaborative filtering systems to quickly evaluate recommendation strategies.

Last edit 5 months ago

Sentiment classification with the Reddit Praw API and GPT-4o-mini

Learn how to build a Reddit sentiment analysis pipeline that uses GPT-4o-mini to extract opinions from real discussions across subreddits—filtering, summarizing, and classifying posts and comments at scale.

Last edit 6 months ago

How the Agent2Agent (A2A) protocol enables seamless AI agent collaboration

The Agent2Agent (A2A) protocol is an open standard that enables autonomous AI agents to securely discover, communicate, and collaborate across platforms. Learn how it works, its core components, and how to implement it.

Last edit 3 months ago

Getting started with reinforcement learning (with a Python tutorial)

A hands-on introduction to reinforcement learning, explaining how agents learn from interaction to solve complex decision-making problems - plus practical implementations of Deep Q-learning and Actor-Critic methods in Python.

Last edit 5 months ago

Evaluating your MCP and A2A agents with W&B Weave

Last edit 2 months ago

Building and evaluating AI agents with Azure AI Foundry Agent Service and W&B Weave

A hands-on guide to building and evaluating real-time, tool-using AI agents with Azure AI Foundry Agent Service, SerpAPI, and W&B Weave.

Last edit 5 months ago

Getting started with Claude Sonnet 4 and Claude Opus 4

Getting set up and running Anthropic's new Claude 4 Sonnet and Opus on your machine in Python using the API.

Last edit 4 months ago

How to evaluate a RAG system using synthetic data with LLM2Vec and W&B

Last edit 1 year ago

Exploring multi-agent reinforcement learning (MARL)

This article provides a practical introduction to multi-agent reinforcement learning (MARL), explaining its theoretical foundations, key algorithms, and frameworks, and showcasing a custom-coded multi-agent Pong environment with self-play DQN agents to illustrate the opportunities and challenges of training AI agents that interact, compete, or cooperate within shared environments.

Last edit 5 months ago

Testing Claude 4 vs. Codex vs. Gemini 2.5 Pro on CodeContests

Putting Claude 4 Sonnet and Opus to the test

Last edit 4 months ago

RAG vs. prompt stuffing: Do we still need vector retrieval?

An exploration of whether vector retrieval still makes sense now that LLMs can handle massive context windows.

Last edit 3 months ago

Google Agent Development Kit (ADK): A hands-on tutorial

Google’s Agent Development Kit (ADK) is a modern, modular Python framework for building, orchestrating, and tracing sophisticated AI-powered agents and workflows - supporting diverse models and tools and designed for easy integration, extensibility, and production observability.

Last edit 3 months ago

The Google GenAI SDK: A guide with a Python tutorial

Google’s GenAI SDK provides developers with a unified, flexible toolkit to seamlessly integrate advanced generative AI capabilities—including text, image, and video processing—into their applications using the latest Gemini models.

Last edit 3 months ago

Reinforcement learning for reasoning: Enhancing AI capabilities

Explore how reinforcement learning with verifiable rewards (RLVR) and GRPO shape LLM reasoning gains, where true improvements lie, plus a practical GRPO training guide.

Last edit 3 months ago

Types of reinforcement learning algorithms

Explore how reinforcement learning helps AI learn from trial and error, with key algorithms, methods like RLHF, and real-world applications.

Last edit 3 months ago

Getting started with the Agent Reinforcement Trainer (ART)

This article introduces ART, an open-source framework that simplifies applying reinforcement learning to large language models, enabling developers to efficiently train and deploy LLM agents that continuously improve through experience, as demonstrated with practical code examples including a tic-tac-toe game.

Last edit 3 months ago

What is RLHF? Reinforcement learning from human feedback for AI alignment

This article explains how reinforcement learning from human feedback (RLHF) is used to train language models that better reflect human preferences, including practical steps, code examples, and evaluation techniques.

Last edit 2 months ago

Defending against MCP prompt injection attacks

AI agents are revolutionizing software automation by leveraging protocols like MCP to directly access databases and developer tools, but this new power comes with significant security risks, as unchecked user input combined with elevated permissions can lead to catastrophic data breaches.

Last edit 2 months ago

Adding observability and tracing to your Bedrock AgentCore Agents

A practical guide to running production AI agents on AWS AgentCore and wiring them to Weave for step-by-step observability, with examples showing setup, instrumentation, and trace-driven debugging.

Last edit 2 months ago

Evaluating Google ADK Agents with W&B Weave for reliable insurance workflows

This article provides a practical, end-to-end walkthrough of building, testing, and evaluating an AI insurance agent using the Agent Development Kit (ADK) and W&B Weave.

Last edit 3 months ago

Tutorials: GPT-5 evaluation across multiple tasks

These tutorials cover how to evaluate GPT-5’s image generation, coding evals, and automated debugging using W&B Weave.

Last edit 2 months ago

How to evaluate the "true" context length of your LLM using RULER

In this article we will explore why long context matters, survey the main benchmarks that aim to test it, and then use RULER as our framework for experiments.

Last edit 1 month ago

Evaluating cost and hyperparameters for Pinecone RAG systems with W&B Weave

Learn to build and evaluate a RAG pipeline with Pinecone and W&B Weave. Cut hallucinations, swap models via W&B Inference, and track accuracy, cost, and latency.

Last edit 16 days ago

Activity

Mon

Wed

Fri

Nov

Dec

Jan

Feb

Mar

Apr

May

Jun

Jul

Aug

Sep

Oct

Runs

unsloth-llama3.2-3b-aws

Finished

2 weeks ago

unsloth-llama3.2-3b-gcp

Finished

2 weeks ago

conv_923b0c15-4ca3-4d10-ba88-10036a2d1ff5

finance-callcenter-voice-agents

Finished

2 weeks ago

conv_0a9a4c2b-f82d-439c-99c3-e85293e3ba7f

finance-callcenter-voice-agents

Crashed

2 weeks ago

conv_cc52cc16-d7b0-4e4f-b248-9920072e8839

finance-callcenter-voice-agents

Finished

2 weeks ago

conv_b33fc197-bf1d-4aaf-a87f-56a02680fdda

finance-callcenter-voice-agents

Crashed

2 weeks ago

unsloth-llama3.2-3b-aws

Finished

2 weeks ago

unsloth-llama3.2-3b-gcp

Finished

2 weeks ago

Crashed

2 weeks ago

mild-firebrand-273

Crashed

3 weeks ago

Loading...