Skip to main content

Brett Young

byyoung3
bdytx5.github.io/brettyoung/
bdytx5

Teams

Reports
Evaluating o4-mini vs. Claude 3.7 vs. Gemini 2.5 Pro on code generation
A real-world head-to-head test of Gemini 2.5 Pro, o4-mini, and Claude 3.7 Sonnet on competitive programming problems—built on a custom execution framework with Weave integration to track correctness, spot bugs, and cut through benchmark hype.
3248 views
Last edit 3 months ago
Building a Github repo summarizer with CrewAI
A hands-on guide to building a fully automated GitHub documentation system using CrewAI for multi-agent coordination and Weave for real-time debugging and observability.
694 views
Last edit 3 months ago
How to fine-tune and evaluate Qwen3 with Unsloth
This article provides a comprehensive guide to fine-tuning, evaluating, and deploying the Qwen3 language model, emphasizing its flexibility, performance, and unique reasoning-toggle feature.
2748 views
Last edit 3 months ago
The Model Context Protocol (MCP): A guide for AI integration
This guide explores how MCP standardizes AI interactions with external tools and data sources, enabling more efficient AI context integrations.
4043 views
Last edit 4 months ago
Training GPT-4o to reason: Fine-tuning vs budget forcing
Can fine-tuning and budget forcing improve GPT-4o’s reasoning? We test structured datasets and inference-time techniques to boost multi-step problem-solving.
795 views
Last edit 6 months ago
Budget forcing s1-32B: Waiting is all you need?
We test whether budget forcing - a simple test-time intervention - can significantly boost the reasoning accuracy of s1-32B, potentially enabling smaller models to rival closed-source giants like OpenAI's o1-preview.
1474 views
Last edit 6 months ago
DeepSeek-R1 vs OpenAI o1: A guide to reasoning model setup and evaluation
Discover the capabilities of DeepSeek-R1 and OpenAI o1 models for reasoning and decision-making. Includes setup guides, API usage, local deployment, and Weave-powered comparisons.
2036 views
Last edit 6 months ago
How to fine-tune a large language model (LLM)
Discover the process of fine-tuning large language models (LLMs) to enhance their performance for specific tasks or domains. Learn about methods, best practices, and challenges associated with LLM fine-tuning.
4277 views
Last edit 8 months ago
How to train and evaluate an LLM router
This tutorial explores LLM routers, inspired by the RouteLLM paper, covering training, evaluation, and practical use cases for managing LLMs effectively.
1494 views
Last edit 11 months ago
Grokking: Improved generalization through over-overfitting
One of the most mysterious phenomena in deep learning; Grokking is the tendency of neural networks to improve generalization by sustained overfitting.
1769 views
Last edit 1 year ago
Training a KANFormer: KAN's Are All You Need?
We will dive into a new experimental architecture, replacing the MLP layers in transformers with KAN layers!
2640 views
Last edit 1 year ago
Knowledge distillation: Teaching LLM's with synthetic data
Unlock the power of knowledge distillation by learning how to efficiently transfer complex model insights from teacher to student models, step by step.
1533 views
Last edit 11 months ago
How to fine-tune Phi-3 Vision on a custom dataset
Here's how to fine tune a state of the art multimodal LLM on a custom dataset
8295 views
Last edit 10 months ago
Building a real-time answer engine with Llama 3.1 405B and W&B Weave
Infusing llama 3.1 405B with internet search capabilities!!
2404 views
Last edit 1 year ago
YOLOv9 object detection tutorial
How to use one of the worlds fastest and most accurate object detectors to run inference, display on your webcam using OpenCV and tracking your results.
1550 views
Last edit 11 months ago
Fine-Tuning Llama-3 with LoRA: TorchTune vs HuggingFace
A battle between the HuggingFace and TorchTune!!!
14625 views
Last edit 9 months ago
Building an LLM Python debugger agent with the new Claude 3.5 Sonnet
Building a AI powered coding agent with Claude 3.5 Sonnet!
605 views
Last edit 5 months ago
o3-mini vs. DeepSeek-R1: API setup, performance testing & model evaluation
Learn how to set up and run OpenAI o3-mini via the API, explore its flexible reasoning effort settings, and compare its performance against DeepSeek-R1 using W&B Weave Evaluations.
770 views
Last edit 6 months ago
Building the worlds fastest chatbot with the Cerebras Cloud API and W&B Weave
A guide to getting started using the Cerebras Cloud API with W&B Weave.
375 views
Last edit 11 months ago
6 "gotchas" in machine learning—and how to avoid them
ML is hard and you can't plan for everything. Here are a few things I've learned and a few tips to avoid common missteps
367 views
Last edit 1 year ago
Building a RAG-Based Digital Restaurant Menu with LlamaIndex and W&B Weave
Powered by RAG, we will transform the traditional restaurant PDF menu into an AI powered interactive menu!
15878 views
Last edit 5 months ago
Fine-Tuning Mistral7B on Python Code With A Single GPU!
A tutorial for fine-tuning Mistral7B on Python Code using a single GPU!
21416 views
Last edit 1 year ago
How to Fine-Tune LLaVA on a Custom Dataset
A tutorial for fine-tuning LLaVA on your own data!
28456 views
Last edit 1 year ago
A Guide to DeepSpeed Zero With the HuggingFace Trainer
A guide for making the most out of your GPU's!
8059 views
Last edit 1 year ago
How to Run Mistral-7B on an M1 Mac With Ollama
Ever wanted to run Mistral 7B on your Macbook? In this tutorial I show you how!
26924 views
Last edit 1 year ago
New Method For LLM Quantization
A new quantization method that plays nicely with CUDA
2734 views
Last edit 2 years ago
AI Expert Speculates on GPT-4 Architecture
George Hotz offers his thoughts on the secretive architecture of OpenAI's GPT-4
4251 views
Last edit 2 years ago
Scaling Llama 2 to 32k Tokens With LongLora
The need for LLMs that can digest long content is becoming increasingly more important. Go beyond 4096 tokens with LongLora!
4083 views
Last edit 1 year ago
Testing Mistral 7B vs. Zephyr 7B on HumanEval: Which Model Writes Better Code?
Putting some of the best 7B parameter models to the test on the HumanEval benchmark!
3832 views
Last edit 1 year ago
Testing Mixtral 8x7B with MMLU and W&B
There's a new LLM on the block, and it isn't from OpenAI! In this article, we run Mixtral 8x7B through its paces with the MMLU dataset and Weights & Biases.
2632 views
Last edit 1 year ago
A Gentle Introduction to Diffusion
Powering some of the world's most advanced image generation models, we will dive into the theory and code of how diffusion models work, and how to train one of our own.
1813 views
Last edit 1 year ago
Fine-Tuning a Legal Copilot Using Azure OpenAI and W&B
Building a tool to demystify the intricacies of legal contracts with AI!
2400 views
Last edit 1 year ago
Building a Virtual Assistant with Google Gemini Function Calling
This article delves into the challenges and solutions for enabling Google Gemini to provide up-to-date information, such as sports schedules, by interfacing with external APIs.
1927 views
Last edit 1 year ago
Skin Lesion Classification on HAM10000 with HuggingFace using PyTorch and W&B
Explore the use of HuggingFace, PyTorch, and W&B for classifying skin lesions with the HAM10000 dataset. We will build, train, and evaluate models for medical diagnostics!
1058 views
Last edit 1 year ago
Creating a predictive models to assess the risk of mortgage clients
My top tips for competing in Kaggle Challenges like the Home Credit Risk Model Stability Challenge.
491 views
Last edit 1 year ago
Getting Started with Deep Q-Learning
Diving deep into the world of RL, we will unpack all of the details of deep Q-learning, and train a model on the OpenAI Gym Cartpole Environment!
441 views
Last edit 1 year ago
Some Details on OpenAI's Sora and Diffusion Transformers
Combining some of the most promising architectures, OpenAI shows off its newest model!
1440 views
Last edit 1 year ago
I put GPT2-chatbot’s coding skills to the test
A new model has shown up on lmsys, and it looks a lot like GPT-4!
2032 views
Last edit 1 year ago
Leveraging synthetic data for tabular financial fraud detection
A guide on overcoming data scarcity for fraud detection
773 views
Last edit 1 year ago
Mastering K-Fold Cross-Validation
A guide on K-Fold cross validation, a fundamental evaluation method in machine learning .
1017 views
Last edit 1 year ago
Creating videos from static images with Stable Video Diffusion
The model known for generating images has been upgraded to handle video! We will cover the basics of the model, and also generate some sample videos!
1000 views
Last edit 1 year ago
A survey of financial datasets for machine learning
An overview of popular datasets used for ML in finance!
761 views
Last edit 1 year ago
Introducing VisionLLM: A New Method for Multi-Modal LLM’s
Humans are multimodal — so shouldn't AI be the same? Here, we discuss the future of vision models, and the recent progress that's been made in multi-modal LLMs.
1165 views
Last edit 2 years ago
Claude 3.5 Sonnet on Vertex AI: Python quickstart
Here's how to get up and running with the newest model from Anthropic
1423 views
Last edit 9 months ago
Self-Supervised Image Recognition with IJEPA
626 views
Last edit 1 year ago
How to handle training divergences with our new rewind feature
Easily handle common LLM training failures with Weights & Biases rewind!
403 views
Last edit 1 year ago
What is Backpropagation?
A guide to one of the most fundamental tools in machine learning.
235 views
Last edit 1 year ago
Getting started fine-tuning with the Mistral API
How to fine-tune Mistral-Small using the Mistral API and W&B Weave.
629 views
Last edit 1 year ago
How to fine-tune YOLO V9 on a custom dataset with W&B
Fine tuning one of the best detection models!
2635 views
Last edit 11 months ago
A guide to using the Azure AI model inference API
Using Microsoft Azure AI model inference API for serverless inference!
1105 views
Last edit 11 months ago
Supercharging LLM summarization
A guide to making the most of LLMs for summarization tasks
511 views
Last edit 11 months ago
Vision fine-tuning GPT-4o on a custom dataset
Learn to vision fine-tune GPT-4o on a custom dataset, with evaluation and tracking.
693 views
Last edit 10 months ago
Building reliable apps with GPT-4o and structured outputs
Learn how to enforce consistency on GPT-4o outputs, and build reliable Gen-AI Apps.
737 views
Last edit 10 months ago
How to evaluate a Langchain RAG system with RAGAs
A guide to evaluating a Langchain RAG system with RAGAs and Weights & Biases.
2016 views
Last edit 9 months ago
Getting started with Apple MLX
A guide to Apple's new deep learning framework. It is faster than Torch on the M1?
2029 views
Last edit 9 months ago
PHI and PII for healthcare in the world of AI
A practical guide on working with health data, safely, with multiple approaches for handling PHI
284 views
Last edit 10 months ago
Building and evaluating a RAG system with DSPy and W&B Weave
A guide to building a RAG system with DSPy, and evaluating it with W&B Weave.
1075 views
Last edit 9 months ago
Ensembling and ensemble learning methods
We'll explore how to combine multiple models together in order to create a more powerful AI model with ensemble learning.
200 views
Last edit 9 months ago
Working with Pixtral Large for visual chart understanding
A battle between Open Source Pixtral Large and closed source foundation models like Claude 3.5 Sonnet and GPT-4o Vision
443 views
Last edit 9 months ago
Evaluating LLMs on Amazon Bedrock
Discover how to use Amazon Bedrock in combination with W&B Weave to evaluate and compare Large Language Models (LLMs) for summarization tasks, leveraging Bedrock’s managed infrastructure and Weave’s advanced evaluation features.


847 views
Last edit 5 months ago
LLaVA-o1: Advancing structured reasoning in vision-language models
Discover how LLaVA-o1 tackles reasoning challenges in multimodal AI with structured problem-solving. Learn about its dataset, capabilities, and performance analysis using W&B Weave.
417 views
Last edit 8 months ago
Comparing GPT Models on Azure AI Foundry with W&B Weave
Learn how to compare and evaluate OpenAI’s GPT models on Azure with W&B Weave on text summarization tasks, leveraging Azure’s managed infrastructure and Weave’s customizable evaluation tools.
236 views
Last edit 8 months ago
Combining open-source PII redaction with closed-model analysis in healthcare using Llama 3.1, MedSpacy and GPT-4o
A guide to PII redaction with AI, covering open-source tools, proprietary models, HIPAA compliance, and how logging supports secure data handling.
430 views
Last edit 8 months ago
Securing your LLM applications against prompt injection attacks
We will focus on understanding prompt injection attacks in AI systems and explore effective strategies to prevent against them!
679 views
Last edit 8 months ago
GraphRAG: Enhancing LLMs with knowledge graphs for superior retrieval
This article introduces GraphRAG, a novel approach that combines knowledge graphs and hierarchical community detection to enable scalable, query-focused summarization and global sensemaking over large datasets.
890 views
Last edit 8 months ago
Build a reliable GenAI search system with Gemini Grounding and Vertex AI
Boost generative AI accuracy with Gemini Grounding in Vertex AI. Learn data integration, custom grounding, and W&B Weave logging in this tutorial
666 views
Last edit 8 months ago
LLM Evaluation on Google Vertex AI
This guide explores large language model (LLM) evaluation with Google Vertex AI and W&B Weave, focusing on comparing different Gemini models for text summarization!
790 views
Last edit 5 months ago
AI guardrails: Understanding PII detection
This article highlights the importance of PII, detection methods like regex, Presidio, and transformers, and evaluation with Weave to ensure accurate and adaptable data protection.
554 views
Last edit 5 months ago
AI scorers: Evaluating AI-generated text with ROUGE
This article explores the ROUGE (Recall-Oriented Understudy for Gisting Evaluation) metric, a powerful tool used for evaluating the quality of AI-generated text
660 views
Last edit 5 months ago
AI scorers: Evaluating AI-generated text with BLEU
This article breaks down BLEU, a key metric for evaluating machine-generated text, covering its mechanics, practical applications with Python and Weave, and its role in improving text generation systems.
795 views
Last edit 5 months ago
AI guardrails: Bias scorers
This article explores bias in AI systems, the need for bias guardrails, detection models, and strategies to mitigate, monitor, and evaluate bias effectively.
268 views
Last edit 5 months ago
AI guardrails: Toxicity scorers
This article explores the challenges of detecting and managing toxicity in AI systems, providing actionable strategies and tools to foster safer and more inclusive digital interactions.
330 views
Last edit 5 months ago
AI Guardrails: Coherence scorers
Coherence, a measure of clarity and logical consistency in AI-generated responses, is effectively evaluated and refined using Weave's comprehensive tools and comparison insights.
471 views
Last edit 5 months ago
Monitoring Amazon Bedrock Agents with W&B Weave
Learn to build and monitor powerful AI agents with Amazon Bedrock and W&B Weave for automated workflows.
464 views
Last edit 5 months ago
S1: Achieving Test-Time Scaling with Just 1,000 Examples
335 views
Last edit 6 months ago
Agentic workflows: Getting started with AI Agents
Explore AI agent workflows for automating tasks with multi-agent systems and generative AI, including a tutorial to build a research assistant for AI summaries.
1431 views
Last edit 6 months ago
Exploring multi-agent AI systems
This project explores multi-agent AI systems, examining how multiple specialized agents collaborate to enhance decision-making, problem-solving, and automation across various domains.
601 views
Last edit 5 months ago
Tutorial: Building AI agents with CrewAI
This guide explores how AI agents, powered by CrewAI, automate complex tasks with minimal human input by integrating adaptive workflows, real-time data analysis, and iterative improvements.
731 views
Last edit 5 months ago
Autonomous AI Agents: Capabilities, challenges, and future trends
Learn how autonomous AI agents automate tasks with minimal supervision, their architecture, applications, risks, and how to build a HackerNews AI news reporter.
399 views
Last edit 5 months ago
AI agents in retail and e-commerce
This article explores how AI agents are transforming retail by automating customer interactions, optimizing decision-making, and enhancing product recommendations using LLM-driven vector search.
186 views
Last edit 4 months ago
Agentic RAG: Enhancing retrieval-augmented generation with AI agents
This article explores how agentic RAG enhances retrieval-augmented generation by using AI agents to dynamically refine search strategies, coordinate multiple data sources, and improve response accuracy.
1247 views
Last edit 4 months ago
Evaluating Claude 3.7 Sonnet: Performance, reasoning, and cost optimization
Experimenting with Anthropic's new flagship LLM, Claude 3.7 Sonnet!
3509 views
Last edit 4 months ago
Evaluating the new Gemini 2.5 Pro Experimental model
Gemini 2.5 Pro Experimental is Google's most advanced AI model to date, featuring multimodal input support, a massive 1 million-token context window, and the ability to solve complex problems.
1647 views
Last edit 4 months ago
Getting Started with MCP using OpenAI Agents
A practical walkthrough for building OpenAI Agents that use the Model Context Protocol (MCP) to access tools, files, and trace data via Weave.
6089 views
Last edit 4 months ago
Quantization-Aware Training (QAT): A step-by-step guide with PyTorch
A practical deep dive into quantization-aware training, covering how it works, why it matters, and how to implement it end-to-end.
3565 views
Last edit 4 months ago
Running inference and evaluating Llama 4 in Python
Deploy Llama 4 locally or via API with Python scripts. We test multimodal performance against GPT-4o on ChartQA and show how to debug and compare results using Weave.
612 views
Last edit 4 months ago
What is Retrieval Augmented Thinking (RAT) and how does it work?
Retrieval Augmented Thinking (RAT) separates AI reasoning from response generation, improving efficiency, interpretability, and customization by using one model for structured thought and another for the final output.
256 views
Last edit 4 months ago
Recommendation systems with collaborative filtering to accelerate time to market
A hands-on guide to building and comparing memory-based and model-based collaborative filtering systems to quickly evaluate recommendation strategies.
199 views
Last edit 3 months ago
Sentiment classification with the Reddit Praw API and GPT-4o-mini
Learn how to build a Reddit sentiment analysis pipeline that uses GPT-4o-mini to extract opinions from real discussions across subreddits—filtering, summarizing, and classifying posts and comments at scale.
242 views
Last edit 4 months ago
How the Agent2Agent (A2A) protocol enables seamless AI agent collaboration
The Agent2Agent (A2A) protocol is an open standard that enables autonomous AI agents to securely discover, communicate, and collaborate across platforms. Learn how it works, its core components, and how to implement it.
1481 views
Last edit 1 month ago
Getting started with reinforcement learning (with a Python tutorial)
A hands-on introduction to reinforcement learning, explaining how agents learn from interaction to solve complex decision-making problems - plus practical implementations of Deep Q-learning and Actor-Critic methods in Python.
339 views
Last edit 3 months ago
Evaluating your MCP and A2A agents with W&B Weave
508 views
Last edit 1 day ago
Building and evaluating AI agents with Azure AI Foundry Agent Service and W&B Weave
A hands-on guide to building and evaluating real-time, tool-using AI agents with Azure AI Foundry Agent Service, SerpAPI, and W&B Weave.
393 views
Last edit 3 months ago
Getting started with Claude Sonnet 4 and Claude Opus 4
Getting set up and running Anthropic's new Claude 4 Sonnet and Opus on your machine in Python using the API.
511 views
Last edit 2 months ago
How to evaluate a RAG system using synthetic data with LLM2Vec and W&B
250 views
Last edit 1 year ago
Exploring multi-agent reinforcement learning (MARL)
This article provides a practical introduction to multi-agent reinforcement learning (MARL), explaining its theoretical foundations, key algorithms, and frameworks, and showcasing a custom-coded multi-agent Pong environment with self-play DQN agents to illustrate the opportunities and challenges of training AI agents that interact, compete, or cooperate within shared environments.
264 views
Last edit 3 months ago
Testing Claude 4 vs. Codex vs. Gemini 2.5 Pro on CodeContests
Putting Claude 4 Sonnet and Opus to the test
952 views
Last edit 2 months ago
RAG vs. prompt stuffing: Do we still need vector retrieval?
An exploration of whether vector retrieval still makes sense now that LLMs can handle massive context windows.
257 views
Last edit 1 month ago
Google Agent Development Kit (ADK): A hands-on tutorial
Google’s Agent Development Kit (ADK) is a modern, modular Python framework for building, orchestrating, and tracing sophisticated AI-powered agents and workflows - supporting diverse models and tools and designed for easy integration, extensibility, and production observability.
1146 views
Last edit 1 month ago
The Google GenAI SDK: A guide with a Python tutorial
Google’s GenAI SDK provides developers with a unified, flexible toolkit to seamlessly integrate advanced generative AI capabilities—including text, image, and video processing—into their applications using the latest Gemini models.
207 views
Last edit 1 month ago
Reinforcement learning for reasoning: Enhancing AI capabilities
Explore how reinforcement learning with verifiable rewards (RLVR) and GRPO shape LLM reasoning gains, where true improvements lie, plus a practical GRPO training guide.
177 views
Last edit 1 month ago
Types of reinforcement learning algorithms
Explore how reinforcement learning helps AI learn from trial and error, with key algorithms, methods like RLHF, and real-world applications.
198 views
Last edit 1 month ago
Getting started with the Agent Reinforcement Trainer (ART)
This article introduces ART, an open-source framework that simplifies applying reinforcement learning to large language models, enabling developers to efficiently train and deploy LLM agents that continuously improve through experience, as demonstrated with practical code examples including a tic-tac-toe game.
135 views
Last edit 28 days ago
What is RLHF? Reinforcement learning from human feedback for AI alignment
This article explains how reinforcement learning from human feedback (RLHF) is used to train language models that better reflect human preferences, including practical steps, code examples, and evaluation techniques.
186 views
Last edit 23 days ago
Defending against MCP prompt injection attacks
AI agents are revolutionizing software automation by leveraging protocols like MCP to directly access databases and developer tools, but this new power comes with significant security risks, as unchecked user input combined with elevated permissions can lead to catastrophic data breaches.
33 views
Last edit 16 days ago
Adding observability and tracing to your Bedrock AgentCore Agents
A practical guide to running production AI agents on AWS AgentCore and wiring them to Weave for step-by-step observability, with examples showing setup, instrumentation, and trace-driven debugging.
74 views
Last edit 7 days ago
Evaluating Google ADK Agents with W&B Weave for reliable insurance workflows
This article provides a practical, end-to-end walkthrough of building, testing, and evaluating an AI insurance agent using the Agent Development Kit (ADK) and W&B Weave.
86 views
Last edit 1 month ago
Tutorials: GPT-5 evaluation across multiple tasks
These tutorials cover how to evaluate GPT-5’s image generation, coding evals, and automated debugging using W&B Weave.
100 views
Last edit 7 days ago
Activity
Mon
Wed
Fri
Sep
Oct
Nov
Dec
Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Runs
Name
Project
State
Created
Finished
Finished
Finished
Finished
Finished
Finished
Finished
Finished
Crashed
Crashed
Loading...