YOLOv9 object detection tutorial

How to use one of the worlds fastest and most accurate object detectors to run inference, display on your webcam using OpenCV and tracking your results.
Brett Young
Created on July 15|Last edited on September 9
Comment
YOLO—short for "you only look once"—is a state-of-the-art object detection model recognized for its speed and accuracy. In this guide, we will explore the implementation of real-time object detection using the latest iteration, YOLOv9, combined with OpenCV for image processing.
﻿
What is YOLO?Getting started with YOLOv9 Using YOLOv9 on your webcamProcessing and displaying framesKeeping track of data with WeaveSpeed is everything
﻿
﻿
What is YOLO?YOLOv9 stands for You Only Look Once version 9, an exceptionally fast object detection framework utilizing a single convolutional neural network. Unlike traditional object detection systems that scan an image pixel-by-pixel, YOLOv9 processes the entire image in one go, resulting in significantly faster detection speeds.
Released in April 2024, YOLOv9 introduces groundbreaking techniques such as Programmable Gradient Information (PGI) and Generalized Efficient Layer Aggregation Network (GELAN) to address data loss and computational efficiency issues in computer vision tasks. These types of innovations are why YOLOv9 achieves outstanding real-time object detection performance, setting new benchmarks for precision and speed. 
﻿
These enhancements allow YOLOv9 to achieve a higher mean average precision than previous YOLO models like YOLOv8, YOLOv7, and YOLOv5 when benchmarked against the MS COCO dataset.
Getting started with YOLOv9 To get started, we'll need to install the following packages: 
pip install opencv-python ultralytics pytubefix requests pillow
Next, it can be very helpful to have a NVIDIA GPU with CUDA installed on your system. That said, if you don’t, that's OK too. You should be able to use YOLO just fine.
Here’s a script that will run inference with YOLOV9:
﻿
from ultralytics import YOLO
import requests
from PIL import Image
from io import BytesIO
import os
﻿
# Download the YOLOv9 model if it doesn't exist
model_path = "yolov9m.pt"
# Load the pretrained YOLOv9 model
model = YOLO(model_path)
﻿
# Download the image from the URL
image_url = "https://di-uploads-pod25.dealerinspire.com/koenigseggflorida/uploads/2019/08/Koenigsegg_TheSquad_3200x2000-UPDATED.jpg"
response = requests.get(image_url)
img = Image.open(BytesIO(response.content))
﻿
# Set the confidence and IoU thresholds
confidence_threshold = 0.5
iou_threshold = 0.4
﻿
# Predict with the model using the set thresholds
results = model.predict(img, conf=confidence_threshold, iou=iou_threshold)
results[0].show()
The simple inference script begins by importing the necessary libraries:
YOLO from the ultralytics library
requests
PIL for image processing
The YOLOv9 model is then loaded by specifying a model path—which, importantly, does not need to be the actual path to an existing model—as the library will download the model if it isn't currently in the specified location. Once the model is loaded, it runs inference on a sample image. The results of the inference, including detected objects and their bounding boxes, are displayed.
This script demonstrates the basic functionality of loading a pre-trained model and running it on a static image to detect objects.
﻿
YOLOv9 inference hyperparameters The confidence threshold and Intersection over Union (IoU) threshold are crucial parameters in YOLO object detection models.
The confidence threshold determines the minimum confidence score a detection must have to be considered valid. A higher confidence threshold reduces false positives, ensuring that only the most certain detections are shown. Conversely, lowering the threshold can increase the number of detections, including less certain ones, which might be useful in applications requiring high sensitivity. 
The IoU threshold is used during non-max suppression (NMS) to eliminate redundant bounding boxes. IoU measures the overlap between two bounding boxes. A higher IoU threshold means that the model will tolerate more overlap between bounding boxes, potentially merging closely located detections. Lowering the IoU threshold makes the model more stringent, keeping only the most distinct bounding boxes.
The choice of these thresholds depends on the application's requirements. For instance, in security systems, high confidence and low IoU thresholds might be preferred to ensure only highly probable threats are detected with clear, non-overlapping bounding boxes. Conversely, in medical imaging, lower thresholds might be used to ensure no potential anomalies are missed.
Using YOLOv9 on your webcamAdditionally, you can use your webcam to visualize your models performance in real time: 
import cv2
from ultralytics import YOLO
import os
# Download the YOLOv9 model if it doesn't exist
model_path = "yolov9c.pt"
# Load the pretrained YOLOv9 model
model = YOLO(model_path)
﻿
# Initialize the webcam
# 0 is the default camera. If you have multiple cameras(ex: an external one on your laptop) you may need to change this number.
cap = cv2.VideoCapture(0)
if not cap.isOpened():
    print("Error: Could not open webcam.")
    exit()
﻿
while True:
    ret, frame = cap.read()
    if not ret:
        print("Error: Could not read frame.")
        break
﻿
    # Convert the frame to the format expected by YOLO
    frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
﻿
    # Run YOLO model on the frame
    results = model.predict(frame_rgb)
﻿
    # Draw bounding boxes on the frame
    for box in results[0].boxes:
        x1, y1, x2, y2 = map(int, box.xyxy.tolist()[0])
        class_id = int(box.cls)
        class_name = results[0].names[class_id]
        confidence = box.conf.item()
        if confidence < .8:
            continue
        
        # Draw rectangle and label on the frame
        cv2.rectangle(frame, (x1, y1), (x2, y2), (0, 255, 0), 2)
        label = f"{class_name}: {confidence:.2f}"
        cv2.putText(frame, label, (x1, y1 - 10), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255, 0), 2)
﻿
    # Display the frame with detections
    cv2.imshow("YOLO Webcam Detection", frame)
﻿
    # Break the loop if the user presses 'q'
    if cv2.waitKey(1) & 0xFF == ord('q'):
        break
﻿
# Release the webcam and close the window
cap.release()
cv2.destroyAllWindows()
The script initializes the webcam using OpenCV’s VideoCapture function, which prepares the webcam for capturing video frames. It then ensures the webcam is successfully opened; if not, the program exits with an error message. Once the webcam is initialized, the script enters a loop to continuously capture frames from the webcam. Each frame is converted from BGR to RGB format since YOLO expects RGB input. The converted frame is then passed to the YOLO model for object detection.
In the line: cap = cv2.VideoCapture(0) 0 is the default camera. If you have multiple cameras(ex: an external one on your laptop) you may need to change this number.
💡
If you want to display the output to Zoom or similar, you'll need to set up a virtual camera. You can find some very simple instructions on how to do with with OBS (free) in the article YOLOv5 Object Detection Tutorial: Bounding Box Webcams For Zoom (the link will jump you to the OBS instructions).
Processing and displaying framesThe YOLO model processes the frame and returns results containing detected objects. For each detected object, the script extracts the bounding box coordinates, class ID, class name, and confidence score. It draws rectangles around the detected objects and labels them with the class name and confidence score. The processed frame with detected objects is displayed using OpenCV’s imshow function.
The loop continues capturing and processing frames until the user exits by pressing 'q'. Finally, the webcam is released and all OpenCV windows are closed to free up resources.
I guess I'm only 94% human. Something to focus on from a personal perspective.
If you want to run YOLO on a YouTube video, you can do that too! Here’s a script that will download a YouTube video, and run the model on each frame while displaying the results. Note, I only recommend using this for education purposes, on a small amount of videos, rather than for scraping large amounts of data. 
import cv2
from ultralytics import YOLO
import os
import argparse
from pytubefix import YouTube
﻿
# Function to download YouTube video using pytubefix
def download_youtube_video(url, output_path):
    try:
        yt = YouTube(url)
        stream = yt.streams.get_highest_resolution()
        stream.download(output_path=output_path, filename="downloaded_video.mp4")
        print("Video downloaded successfully.")
    except Exception as e:
        print(f"Error downloading video: {e}")
﻿
# Parse command-line arguments
parser = argparse.ArgumentParser(description="YOLOv5 Object Detection on YouTube Video")
parser.add_argument("url", type=str, help="URL of the YouTube video to process")
args = parser.parse_args()
﻿
﻿
﻿
# Download the YOLOv9 model if it doesn't exist
model_path = "yolov9c.pt"
# Load the pretrained YOLOv9 model
model = YOLO(model_path)
﻿
﻿
# Download YouTube video
youtube_url = args.url
download_youtube_video(youtube_url, os.getcwd())
﻿
# Open the downloaded video
video_path = "downloaded_video.mp4"
cap = cv2.VideoCapture(video_path)
﻿
if not cap.isOpened():
    print("Error: Could not open video.")
    exit()
﻿
while True:
    ret, frame = cap.read()
    if not ret:
        print("End of video.")
        break
﻿
    # Convert the frame to the format expected by YOLO
    frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
﻿
    # Run YOLO model on the frame
    results = model.predict(frame_rgb)
﻿
    # Draw bounding boxes on the frame
    for box in results[0].boxes:
        x1, y1, x2, y2 = map(int, box.xyxy.tolist()[0])
        class_id = int(box.cls)
        class_name = results[0].names[class_id]
        confidence = box.conf.item()
        if confidence < 0.5:
            continue
        
        # Draw rectangle and label on the frame
        cv2.rectangle(frame, (x1, y1), (x2, y2), (0, 255, 0), 2)
        label = f"{class_name}: {confidence:.2f}"
        cv2.putText(frame, label, (x1, y1 - 10), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255, 0), 2)
﻿
    # Display the frame with detections
    cv2.imshow("YOLO Video Detection", frame)
﻿
    # Break the loop if the user presses 'q'
    if cv2.waitKey(1) & 0xFF == ord('q'):
        break
﻿
# Release the video and close the window
cap.release()
cv2.destroyAllWindows()
This script begins by importing the required libraries and initializing the YOLOv9 model. It includes a function to download a YouTube video using the pytubefix library. The video is downloaded in the highest resolution available and saved to the specified output path.
Command-line arguments are parsed to get the URL of the YouTube video to be processed. The YOLOv9 model is then loaded, and the YouTube video is downloaded to the current working directory. The downloaded video is opened for processing using OpenCV’s VideoCapture function.
Detected objects are highlighted with bounding boxes and labels, similar to the webcam implementation. The processed frames are displayed using OpenCV’s imshow function. The loop continues until the end of the video or until the user exits by pressing 'q'. Finally, the video is released and all OpenCV windows are closed to free up resources.
﻿
Keeping track of data with Weave﻿Weave offers a seamless way to log and debug model inputs and outputs, build evaluations, and organize information from experimentation to production with a single line of code. Here's how we can integrate Weave into a YOLOv9 inference pipeline.
from ultralytics import YOLO
import requests
from PIL import Image
from io import BytesIO
import os
import wandb
import weave
﻿
# Initialize Weave and wandb with the same project name
project_name = "yolo_training"
weave.init(project_name)
run = wandb.init(project=project_name)
﻿
# Use the specified artifact
artifact = run.use_artifact('byyoung3/model-registry/yolo:v0', type='model')
artifact_dir = artifact.download()
﻿
# Define the path to the downloaded model
model_path = os.path.join(artifact_dir, "best.pt")
﻿
# Load the pretrained YOLOv9 model
model = YOLO(model_path)
﻿
# Function to run inference on a single image
@weave.op
def run_inference(image: Image.Image) -> dict:
    try:
        # Save the image locally for prediction
        local_image_path = 'temp_image.jpg'
        image.save(local_image_path)
        
        # Run the YOLO model on the image with adjusted NMS threshold
        results = model.predict(local_image_path, conf=0.7, iou=0.2)
        
        # Draw bounding boxes on the image and save the result
        results[0].save(local_image_path)
        result_image = Image.open(local_image_path)
﻿
        # Extract predictions
        predictions = []
        for box in results[0].boxes:
            class_id = int(box.cls)
            class_name = results[0].names[class_id]
            confidence = box.conf.item()
            coordinates = box.xyxy.tolist()
            predictions.append({
                'class': class_name,
                'confidence': confidence,
                'coordinates': coordinates
            })
        
        # Prepare the results
        result_data = {
            'result_image': result_image,
            'predictions': predictions
        }
        
        return result_data
    except Exception as e:
        return {'error': str(e)}
﻿
# Download the image from the URL
image_url = "https://i.ytimg.com/vi/7FmHydF9Gvg/hqdefault.jpg"
response = requests.get(image_url)
image = Image.open(BytesIO(response.content))
﻿
# Run inference using the downloaded image
inference_result = run_inference(image)
print(inference_result)
In our script, we integrated Weave to track and evaluate the model's performance. We configured both Weave and Weights & Biases with the same project name to ensure seamless synchronization. We then created a function, decorated with @weave.op, that directly processes PIL images for inference. With the latest updates to Weave, there is no need to convert images into base64 data URLs, making the logging process more straightforward and efficient.
The inference function processes raw images, runs the model, and directly logs the images along with the predictions, including bounding boxes and detection details. This approach simplifies the workflow, allowing for clear and direct tracking of model outputs, such as class names, confidence scores, and coordinates.
Weave efficiently logs all relevant outputs, making it easier to monitor, debug, and understand the model's behavior throughout experimentation and production. Here's a preview of what the logged data looks like inside Weave:
﻿
﻿
﻿
Speed is everythingAs we wrap up this guide on real-time object detection using YOLOv9 and OpenCV, it's clear that YOLOv9 offers a practical and efficient solution for various applications. From static image analysis to real-time webcam and YouTube video processing, the examples provided demonstrate the model's effectiveness and versatility.
Implementing YOLOv9 with OpenCV is straightforward, and the improvements in speed and accuracy make it a valuable tool for developers and researchers. Whether you're working on security systems, autonomous vehicles, or any other project requiring reliable object detection, YOLOv9 provides a solid foundation.
Also, if you are interested in fine-tuning the model, stay tuned as I will also be releasing a guide on using YOLO with fine-tuning!
﻿
Getting started fine-tuning with the Mistral API
How to fine-tune Mistral-Small using the Mistral API and W&B Weave.
Self-Supervised Image Recognition with IJEPA  
Grokking: Improved generalization through over-overfitting
One of the most mysterious phenomena in deep learning; Grokking is the tendency of neural networks to improve generalization by sustained overfitting.
6 "gotchas" in machine learning—and how to avoid them
ML is hard and you can't plan for everything. Here are a few things I've learned and a few tips to avoid common missteps
﻿
﻿
Add a comment
Tags: Articles, Weave, YOLO, Computer Vision, Tutorial, Framework / Integration
Iterate on AI agents and models faster. Try Weights & Biases today.