Reports
Evaluating o4-mini vs. Claude 3.7 vs. Gemini 2.5 Pro on code generation
A real-world head-to-head test of Gemini 2.5 Pro, o4-mini, and Claude 3.7 Sonnet on competitive programming problems—built on a custom execution framework with Weave integration to track correctness, spot bugs, and cut through benchmark hype.
Building a Github repo summarizer with CrewAI
A hands-on guide to building a fully automated GitHub documentation system using CrewAI for multi-agent coordination and Weave for real-time debugging and observability.
How to fine-tune and evaluate Qwen3 with Unsloth
This article provides a comprehensive guide to fine-tuning, evaluating, and deploying the Qwen3 language model, emphasizing its flexibility, performance, and unique reasoning-toggle feature.
The Model Context Protocol (MCP): A guide for AI integration
This guide explores how MCP standardizes AI interactions with external tools and data sources, enabling more efficient AI context integrations.
Training GPT-4o to reason: Fine-tuning vs budget forcing
Can fine-tuning and budget forcing improve GPT-4o’s reasoning? We test structured datasets and inference-time techniques to boost multi-step problem-solving.
Budget forcing s1-32B: Waiting is all you need?
We test whether budget forcing - a simple test-time intervention - can significantly boost the reasoning accuracy of s1-32B, potentially enabling smaller models to rival closed-source giants like OpenAI's o1-preview.
DeepSeek-R1 vs OpenAI o1: A guide to reasoning model setup and evaluation
Discover the capabilities of DeepSeek-R1 and OpenAI o1 models for reasoning and decision-making. Includes setup guides, API usage, local deployment, and Weave-powered comparisons.
How to fine-tune a large language model (LLM)
Discover the process of fine-tuning large language models (LLMs) to enhance their performance for specific tasks or domains. Learn about methods, best practices, and challenges associated with LLM fine-tuning.
How to train and evaluate an LLM router
This tutorial explores LLM routers, inspired by the RouteLLM paper, covering training, evaluation, and practical use cases for managing LLMs effectively.
Grokking: Improved generalization through over-overfitting
One of the most mysterious phenomena in deep learning; Grokking is the tendency of neural networks to improve generalization by sustained overfitting.
Training a KANFormer: KAN's Are All You Need?
We will dive into a new experimental architecture, replacing the MLP layers in transformers with KAN layers!
Knowledge distillation: Teaching LLM's with synthetic data
Unlock the power of knowledge distillation by learning how to efficiently transfer complex model insights from teacher to student models, step by step.
How to fine-tune Phi-3 Vision on a custom dataset
Here's how to fine tune a state of the art multimodal LLM on a custom dataset
Building a real-time answer engine with Llama 3.1 405B and W&B Weave
Infusing llama 3.1 405B with internet search capabilities!!
YOLOv9 object detection tutorial
How to use one of the worlds fastest and most accurate object detectors to run inference, display on your webcam using OpenCV and tracking your results.
Fine-Tuning Llama-3 with LoRA: TorchTune vs HuggingFace
A battle between the HuggingFace and TorchTune!!!
Building an LLM Python debugger agent with the new Claude 3.5 Sonnet
Building a AI powered coding agent with Claude 3.5 Sonnet!
o3-mini vs. DeepSeek-R1: API setup, performance testing & model evaluation
Learn how to set up and run OpenAI o3-mini via the API, explore its flexible reasoning effort settings, and compare its performance against DeepSeek-R1 using W&B Weave Evaluations.
Building the worlds fastest chatbot with the Cerebras Cloud API and W&B Weave
A guide to getting started using the Cerebras Cloud API with W&B Weave.
6 "gotchas" in machine learning—and how to avoid them
ML is hard and you can't plan for everything. Here are a few things I've learned and a few tips to avoid common missteps
Building a RAG-Based Digital Restaurant Menu with LlamaIndex and W&B Weave
Powered by RAG, we will transform the traditional restaurant PDF menu into an AI powered interactive menu!
Fine-Tuning Mistral7B on Python Code With A Single GPU!
A tutorial for fine-tuning Mistral7B on Python Code using a single GPU!
How to Fine-Tune LLaVA on a Custom Dataset
A tutorial for fine-tuning LLaVA on your own data!
A Guide to DeepSpeed Zero With the HuggingFace Trainer
A guide for making the most out of your GPU's!
How to Run Mistral-7B on an M1 Mac With Ollama
Ever wanted to run Mistral 7B on your Macbook? In this tutorial I show you how!
New Method For LLM Quantization
A new quantization method that plays nicely with CUDA
AI Expert Speculates on GPT-4 Architecture
George Hotz offers his thoughts on the secretive architecture of OpenAI's GPT-4
Scaling Llama 2 to 32k Tokens With LongLora
The need for LLMs that can digest long content is becoming increasingly more important. Go beyond 4096 tokens with LongLora!
Testing Mistral 7B vs. Zephyr 7B on HumanEval: Which Model Writes Better Code?
Putting some of the best 7B parameter models to the test on the HumanEval benchmark!
Testing Mixtral 8x7B with MMLU and W&B
There's a new LLM on the block, and it isn't from OpenAI! In this article, we run Mixtral 8x7B through its paces with the MMLU dataset and Weights & Biases.
A Gentle Introduction to Diffusion
Powering some of the world's most advanced image generation models, we will dive into the theory and code of how diffusion models work, and how to train one of our own.
Fine-Tuning a Legal Copilot Using Azure OpenAI and W&B
Building a tool to demystify the intricacies of legal contracts with AI!
Building a Virtual Assistant with Google Gemini Function Calling
This article delves into the challenges and solutions for enabling Google Gemini to provide up-to-date information, such as sports schedules, by interfacing with external APIs.
Skin Lesion Classification on HAM10000 with HuggingFace using PyTorch and W&B
Explore the use of HuggingFace, PyTorch, and W&B for classifying skin lesions with the HAM10000 dataset. We will build, train, and evaluate models for medical diagnostics!
Creating a predictive models to assess the risk of mortgage clients
My top tips for competing in Kaggle Challenges like the Home Credit Risk Model Stability Challenge.
Getting Started with Deep Q-Learning
Diving deep into the world of RL, we will unpack all of the details of deep Q-learning, and train a model on the OpenAI Gym Cartpole Environment!
Some Details on OpenAI's Sora and Diffusion Transformers
Combining some of the most promising architectures, OpenAI shows off its newest model!
I put GPT2-chatbot’s coding skills to the test
A new model has shown up on lmsys, and it looks a lot like GPT-4!
Leveraging synthetic data for tabular financial fraud detection
A guide on overcoming data scarcity for fraud detection
Mastering K-Fold Cross-Validation
A guide on K-Fold cross validation, a fundamental evaluation method in machine learning .
Creating videos from static images with Stable Video Diffusion
The model known for generating images has been upgraded to handle video! We will cover the basics of the model, and also generate some sample videos!
A survey of financial datasets for machine learning
An overview of popular datasets used for ML in finance!
Introducing VisionLLM: A New Method for Multi-Modal LLM’s
Humans are multimodal — so shouldn't AI be the same? Here, we discuss the future of vision models, and the recent progress that's been made in multi-modal LLMs.
Claude 3.5 Sonnet on Vertex AI: Python quickstart
Here's how to get up and running with the newest model from Anthropic
Self-Supervised Image Recognition with IJEPA
How to handle training divergences with our new rewind feature
Easily handle common LLM training failures with Weights & Biases rewind!
What is Backpropagation?
A guide to one of the most fundamental tools in machine learning.
Getting started fine-tuning with the Mistral API
How to fine-tune Mistral-Small using the Mistral API and W&B Weave.
How to fine-tune YOLO V9 on a custom dataset with W&B
Fine tuning one of the best detection models!
A guide to using the Azure AI model inference API
Using Microsoft Azure AI model inference API for serverless inference!
Supercharging LLM summarization
A guide to making the most of LLMs for summarization tasks
Vision fine-tuning GPT-4o on a custom dataset
Learn to vision fine-tune GPT-4o on a custom dataset, with evaluation and tracking.
Building reliable apps with GPT-4o and structured outputs
Learn how to enforce consistency on GPT-4o outputs, and build reliable Gen-AI Apps.
How to evaluate a Langchain RAG system with RAGAs
A guide to evaluating a Langchain RAG system with RAGAs and Weights & Biases.
Getting started with Apple MLX
A guide to Apple's new deep learning framework. It is faster than Torch on the M1?
PHI and PII for healthcare in the world of AI
A practical guide on working with health data, safely, with multiple approaches for handling PHI
Building and evaluating a RAG system with DSPy and W&B Weave
A guide to building a RAG system with DSPy, and evaluating it with W&B Weave.
Ensembling and ensemble learning methods
We'll explore how to combine multiple models together in order to create a more powerful AI model with ensemble learning.
Working with Pixtral Large for visual chart understanding
A battle between Open Source Pixtral Large and closed source foundation models like Claude 3.5 Sonnet and GPT-4o Vision
Evaluating LLMs on Amazon Bedrock
Discover how to use Amazon Bedrock in combination with W&B Weave to evaluate and compare Large Language Models (LLMs) for summarization tasks, leveraging Bedrock’s managed infrastructure and Weave’s advanced evaluation features.
LLaVA-o1: Advancing structured reasoning in vision-language models
Discover how LLaVA-o1 tackles reasoning challenges in multimodal AI with structured problem-solving. Learn about its dataset, capabilities, and performance analysis using W&B Weave.
Comparing GPT Models on Azure AI Foundry with W&B Weave
Learn how to compare and evaluate OpenAI’s GPT models on Azure with W&B Weave on text summarization tasks, leveraging Azure’s managed infrastructure and Weave’s customizable evaluation tools.
Combining open-source PII redaction with closed-model analysis in healthcare using Llama 3.1, MedSpacy and GPT-4o
A guide to PII redaction with AI, covering open-source tools, proprietary models, HIPAA compliance, and how logging supports secure data handling.
Securing your LLM applications against prompt injection attacks
We will focus on understanding prompt injection attacks in AI systems and explore effective strategies to prevent against them!
GraphRAG: Enhancing LLMs with knowledge graphs for superior retrieval
This article introduces GraphRAG, a novel approach that combines knowledge graphs and hierarchical community detection to enable scalable, query-focused summarization and global sensemaking over large datasets.
Build a reliable GenAI search system with Gemini Grounding and Vertex AI
Boost generative AI accuracy with Gemini Grounding in Vertex AI. Learn data integration, custom grounding, and W&B Weave logging in this tutorial
LLM Evaluation on Google Vertex AI
This guide explores large language model (LLM) evaluation with Google Vertex AI and W&B Weave, focusing on comparing different Gemini models for text summarization!
AI guardrails: Understanding PII detection
This article highlights the importance of PII, detection methods like regex, Presidio, and transformers, and evaluation with Weave to ensure accurate and adaptable data protection.
AI scorers: Evaluating AI-generated text with ROUGE
This article explores the ROUGE (Recall-Oriented Understudy for Gisting Evaluation) metric, a powerful tool used for evaluating the quality of AI-generated text
AI scorers: Evaluating AI-generated text with BLEU
This article breaks down BLEU, a key metric for evaluating machine-generated text, covering its mechanics, practical applications with Python and Weave, and its role in improving text generation systems.
AI guardrails: Bias scorers
This article explores bias in AI systems, the need for bias guardrails, detection models, and strategies to mitigate, monitor, and evaluate bias effectively.
AI guardrails: Toxicity scorers
This article explores the challenges of detecting and managing toxicity in AI systems, providing actionable strategies and tools to foster safer and more inclusive digital interactions.
AI Guardrails: Coherence scorers
Coherence, a measure of clarity and logical consistency in AI-generated responses, is effectively evaluated and refined using Weave's comprehensive tools and comparison insights.
Monitoring Amazon Bedrock Agents with W&B Weave
Learn to build and monitor powerful AI agents with Amazon Bedrock and W&B Weave for automated workflows.
S1: Achieving Test-Time Scaling with Just 1,000 Examples
Agentic workflows: Getting started with AI Agents
Explore AI agent workflows for automating tasks with multi-agent systems and generative AI, including a tutorial to build a research assistant for AI summaries.
Exploring multi-agent AI systems
This project explores multi-agent AI systems, examining how multiple specialized agents collaborate to enhance decision-making, problem-solving, and automation across various domains.
Tutorial: Building AI agents with CrewAI
This guide explores how AI agents, powered by CrewAI, automate complex tasks with minimal human input by integrating adaptive workflows, real-time data analysis, and iterative improvements.
Autonomous AI Agents: Capabilities, challenges, and future trends
Learn how autonomous AI agents automate tasks with minimal supervision, their architecture, applications, risks, and how to build a HackerNews AI news reporter.
AI agents in retail and e-commerce
This article explores how AI agents are transforming retail by automating customer interactions, optimizing decision-making, and enhancing product recommendations using LLM-driven vector search.
Agentic RAG: Enhancing retrieval-augmented generation with AI agents
This article explores how agentic RAG enhances retrieval-augmented generation by using AI agents to dynamically refine search strategies, coordinate multiple data sources, and improve response accuracy.
Evaluating Claude 3.7 Sonnet: Performance, reasoning, and cost optimization
Experimenting with Anthropic's new flagship LLM, Claude 3.7 Sonnet!
Evaluating the new Gemini 2.5 Pro Experimental model
Gemini 2.5 Pro Experimental is Google's most advanced AI model to date, featuring multimodal input support, a massive 1 million-token context window, and the ability to solve complex problems.
Getting Started with MCP using OpenAI Agents
A practical walkthrough for building OpenAI Agents that use the Model Context Protocol (MCP) to access tools, files, and trace data via Weave.
Quantization-Aware Training (QAT): A step-by-step guide with PyTorch
A practical deep dive into quantization-aware training, covering how it works, why it matters, and how to implement it end-to-end.
Running inference and evaluating Llama 4 in Python
Deploy Llama 4 locally or via API with Python scripts. We test multimodal performance against GPT-4o on ChartQA and show how to debug and compare results using Weave.
What is Retrieval Augmented Thinking (RAT) and how does it work?
Retrieval Augmented Thinking (RAT) separates AI reasoning from response generation, improving efficiency, interpretability, and customization by using one model for structured thought and another for the final output.
Recommendation systems with collaborative filtering to accelerate time to market
A hands-on guide to building and comparing memory-based and model-based collaborative filtering systems to quickly evaluate recommendation strategies.
Sentiment classification with the Reddit Praw API and GPT-4o-mini
Learn how to build a Reddit sentiment analysis pipeline that uses GPT-4o-mini to extract opinions from real discussions across subreddits—filtering, summarizing, and classifying posts and comments at scale.
How the Agent2Agent (A2A) protocol enables seamless AI agent collaboration
The Agent2Agent (A2A) protocol is an open standard that enables autonomous AI agents to securely discover, communicate, and collaborate across platforms. Learn how it works, its core components, and how to implement it.
Getting started with reinforcement learning (with a Python tutorial)
A hands-on introduction to reinforcement learning, explaining how agents learn from interaction to solve complex decision-making problems - plus practical implementations of Deep Q-learning and Actor-Critic methods in Python.
Evaluating your MCP and A2A agents with W&B Weave
Building and evaluating AI agents with Azure AI Foundry Agent Service and W&B Weave
A hands-on guide to building and evaluating real-time, tool-using AI agents with Azure AI Foundry Agent Service, SerpAPI, and W&B Weave.
Getting started with Claude Sonnet 4 and Claude Opus 4
Getting set up and running Anthropic's new Claude 4 Sonnet and Opus on your machine in Python using the API.
How to evaluate a RAG system using synthetic data with LLM2Vec and W&B
Exploring multi-agent reinforcement learning (MARL)
This article provides a practical introduction to multi-agent reinforcement learning (MARL), explaining its theoretical foundations, key algorithms, and frameworks, and showcasing a custom-coded multi-agent Pong environment with self-play DQN agents to illustrate the opportunities and challenges of training AI agents that interact, compete, or cooperate within shared environments.
Testing Claude 4 vs. Codex vs. Gemini 2.5 Pro on CodeContests
Putting Claude 4 Sonnet and Opus to the test
RAG vs. prompt stuffing: Do we still need vector retrieval?
An exploration of whether vector retrieval still makes sense now that LLMs can handle massive context windows.
Google Agent Development Kit (ADK): A hands-on tutorial
Google’s Agent Development Kit (ADK) is a modern, modular Python framework for building, orchestrating, and tracing sophisticated AI-powered agents and workflows - supporting diverse models and tools and designed for easy integration, extensibility, and production observability.
The Google GenAI SDK: A guide with a Python tutorial
Google’s GenAI SDK provides developers with a unified, flexible toolkit to seamlessly integrate advanced generative AI capabilities—including text, image, and video processing—into their applications using the latest Gemini models.
Reinforcement learning for reasoning: Enhancing AI capabilities
Explore how reinforcement learning with verifiable rewards (RLVR) and GRPO shape LLM reasoning gains, where true improvements lie, plus a practical GRPO training guide.
Types of reinforcement learning algorithms
Explore how reinforcement learning helps AI learn from trial and error, with key algorithms, methods like RLHF, and real-world applications.
Getting started with the Agent Reinforcement Trainer (ART)
This article introduces ART, an open-source framework that simplifies applying reinforcement learning to large language models, enabling developers to efficiently train and deploy LLM agents that continuously improve through experience, as demonstrated with practical code examples including a tic-tac-toe game.
What is RLHF? Reinforcement learning from human feedback for AI alignment
This article explains how reinforcement learning from human feedback (RLHF) is used to train language models that better reflect human preferences, including practical steps, code examples, and evaluation techniques.
Defending against MCP prompt injection attacks
AI agents are revolutionizing software automation by leveraging protocols like MCP to directly access databases and developer tools, but this new power comes with significant security risks, as unchecked user input combined with elevated permissions can lead to catastrophic data breaches.
Adding observability and tracing to your Bedrock AgentCore Agents
A practical guide to running production AI agents on AWS AgentCore and wiring them to Weave for step-by-step observability, with examples showing setup, instrumentation, and trace-driven debugging.
Evaluating Google ADK Agents with W&B Weave for reliable insurance workflows
This article provides a practical, end-to-end walkthrough of building, testing, and evaluating an AI insurance agent using the Agent Development Kit (ADK) and W&B Weave.
Tutorials: GPT-5 evaluation across multiple tasks
These tutorials cover how to evaluate GPT-5’s image generation, coding evals, and automated debugging using W&B Weave.
Activity
Mon
Wed
Fri
SepOctNovDecJanFebMarAprMayJunJulAug
Runs
Name
Project
State
Created
Loading...