Skip to main content

Hemm: Holistic Evaluation of Multi-modal Generative Models

Created on October 2|Last edited on October 3

Introduction

Recent strides in text-to-image generation models have shown their potential to create a wide range of high-fidelity images from natural language prompts. However, the challenge of effectively arranging objects with diverse attributes and relationships into a coherent scene remains. To address this, we present Hemm, a library designed to comprehensively evaluate text-to-image generation models for prompt comprehension.
Hemm is built using the powerful logging and tracing capabilities of Weave and Weights & Biases to perform a comprehensive apples-to-apples evaluation of a text-to-image generation model based on several SoTA metrics for prompt comprehension. Hemm is based on the metrics proposed in the following projects:

Quickstart

Installation

First, we recommend you install the PyTorch by visiting pytorch.org/get-started/locally.
Next, you can clone and install Hemm using the following commands.
git clone https://github.com/wandb/Hemm
cd Hemm
pip install -e ".[core]"


Publish a Weave Dataset for Evaluation

First, you need to publish your evaluation dataset to Weave. Check out this tutorial that shows you how to publish a dataset on your project.

Exploring a Weave Dataset on the UI
1


Running the Evaluations

Once you have a dataset on your Weave project, you can evaluate a text-to-image generation model on the metrics.
import wandb
import weave

from hemm.eval_pipelines import BaseDiffusionModel, EvaluationPipeline
from hemm.metrics.prompt_alignment import CLIPImageQualityScoreMetric, CLIPScoreMetric

# Initialize Weave and WandB
wandb.init(project="image-quality-leaderboard", job_type="evaluation")
weave.init(project_name="image-quality-leaderboard")

# Initialize the diffusion model to be evaluated as a `weave.Model` using `BaseWeaveModel`
# The `BaseDiffusionModel` class uses a `diffusers.DiffusionPipeline` under the hood.
# You can write your own model `weave.Model` if your model is not diffusers compatible.
model = BaseDiffusionModel(diffusion_model_name_or_path="CompVis/stable-diffusion-v1-4")

# Add the model to the evaluation pipeline
evaluation_pipeline = EvaluationPipeline(model=model)

# Add PSNR Metric to the evaluation pipeline
psnr_metric = PSNRMetric(image_size=evaluation_pipeline.image_size)
evaluation_pipeline.add_metric(psnr_metric)

# Add SSIM Metric to the evaluation pipeline
ssim_metric = SSIMMetric(image_size=evaluation_pipeline.image_size)
evaluation_pipeline.add_metric(ssim_metric)

# Add LPIPS Metric to the evaluation pipeline
lpips_metric = LPIPSMetric(image_size=evaluation_pipeline.image_size)
evaluation_pipeline.add_metric(lpips_metric)

# Get the Weave dataset reference
dataset = weave.ref("COCO:v0").get()

# Evaluate!
evaluation_pipeline(dataset=dataset)

The Weave Evaluation UI
1


Hemm Leaderboards