Hemm: Holistic Evaluation of Multi-modal Generative Models

Created on October 2|Last edited on October 3
Comment
﻿
IntroductionRecent strides in text-to-image generation models have shown their potential to create a wide range of high-fidelity images from natural language prompts. However, the challenge of effectively arranging objects with diverse attributes and relationships into a coherent scene remains. To address this, we present Hemm, a library designed to comprehensively evaluate text-to-image generation models for prompt comprehension.
Hemm is built using the powerful logging and tracing capabilities of Weave and Weights & Biases to perform a comprehensive apples-to-apples evaluation of a text-to-image generation model based on several SoTA metrics for prompt comprehension. Hemm is based on the metrics proposed in the following projects:
﻿Holistic Evaluation of Text-To-Image Models﻿
﻿T2I-CompBench: A Comprehensive Benchmark for Open-world Compositional Text-to-image Generation﻿
﻿T2I-CompBench++: An Enhanced and Comprehensive Benchmark for Compositional Text-to-image Generation﻿
﻿GenEval: An Object-Focused Framework for Evaluating Text-to-Image Alignment﻿
Quickstart
InstallationFirst, we recommend you install the PyTorch by visiting pytorch.org/get-started/locally.
Next, you can clone and install Hemm using the following commands.
git clone https://github.com/wandb/Hemm
cd Hemm
pip install -e ".[core]"
﻿
Publish a Weave Dataset for EvaluationFirst, you need to publish your evaluation dataset to Weave. Check out this tutorial that shows you how to publish a dataset on your project.
﻿
Exploring a Weave Dataset on the UI1
﻿
Running the EvaluationsOnce you have a dataset on your Weave project, you can evaluate a text-to-image generation model on the metrics.
import wandb
import weave
﻿
from hemm.eval_pipelines import BaseDiffusionModel, EvaluationPipeline
from hemm.metrics.prompt_alignment import CLIPImageQualityScoreMetric, CLIPScoreMetric
﻿
# Initialize Weave and WandB
wandb.init(project="image-quality-leaderboard", job_type="evaluation")
weave.init(project_name="image-quality-leaderboard")
﻿
# Initialize the diffusion model to be evaluated as a `weave.Model` using `BaseWeaveModel`
# The `BaseDiffusionModel` class uses a `diffusers.DiffusionPipeline` under the hood.
# You can write your own model `weave.Model` if your model is not diffusers compatible.
model = BaseDiffusionModel(diffusion_model_name_or_path="CompVis/stable-diffusion-v1-4")
﻿
# Add the model to the evaluation pipeline
evaluation_pipeline = EvaluationPipeline(model=model)
﻿
# Add PSNR Metric to the evaluation pipeline
psnr_metric = PSNRMetric(image_size=evaluation_pipeline.image_size)
evaluation_pipeline.add_metric(psnr_metric)
﻿
# Add SSIM Metric to the evaluation pipeline
ssim_metric = SSIMMetric(image_size=evaluation_pipeline.image_size)
evaluation_pipeline.add_metric(ssim_metric)
﻿
# Add LPIPS Metric to the evaluation pipeline
lpips_metric = LPIPSMetric(image_size=evaluation_pipeline.image_size)
evaluation_pipeline.add_metric(lpips_metric)
﻿
# Get the Weave dataset reference
dataset = weave.ref("COCO:v0").get()
﻿
# Evaluate!
evaluation_pipeline(dataset=dataset)
 
﻿
The Weave Evaluation UI1
﻿
Hemm Leaderboards﻿
Leaderboard: Rendering prompts with Complex Actions
﻿
﻿
Add a comment