Compare Methods for Converting and Optimizing HuggingFace Models for Deployment

In this article, we'll walk through how to convert trained HuggingFace models to slimmer, leaner models for deployment with code examples.
Manan Goel, Anish Shah
Created on November 7|Last edited on January 10
Comment
In this article, we're going to walk through how to convert trained PyTorch and Keras models to slimmer, leaner models for deployment. Let's start with the "why" and then move on to the code.
﻿
project.artifact("sequence-classification-onnx-optimum-quantized")
sequence-classification-onnx-optimum-quantizedVersions
All Versions
Aliases
latest
Versions
v0
Artifact overview
Type
model
Created At
November 9th, 2022
Description
Versions
1-1
 of 1
Version
Aliases
Logged By
Tags
Created
TTL Remaining
# of Consuming Runs
Size
0
latest
v0
avid-smoke-17
Wed Nov 09 2022
Inactive
1
170.5MB
Loading...
﻿
Most people train their models using Keras or PyTorch. The question we want to answer today is why you can't just load those models in the app and use them for inference. 
Before we dive in, here's what we'll be covering:
Table of ContentsWhy Are We Converting and Optimizing Our PyTorch and Keras Models?Fine-Tuning a Transformer ModelConverting the Model to a Suitable FormatOptimizing the Model for Fast InferenceBenchmarksConclusionReferences
﻿
﻿
Let's get going.
Why Are We Converting and Optimizing Our PyTorch and Keras Models?The reason to convert and optimized PyTorch and Keras models is that these libraries by themselves are extremely large and, in a lot of cases, so are the models. This makes deploying them for inference slower, more memory intensive, and ultimately more expensive. 
If you're hoping the model can be deployed on an IoT device, there's simply no practical way to deploy a PyTorch model on a small device. 
The good news? There are a couple of ways to take your trained models, convert them to a format more suitable for inference, and then optimize those models for faster inference! 
In this article, we'll walk you through how you can take a pre-trained Transformer from HuggingFace 🤗, fine-tune it on the task of your choice, convert it to ONNX or TensorRT and finally optimize it to make it ready for deployment.
If you'd like to try any of this on your own, all the code in this report can be copied and run on an environment of your choice. And, if you'd like to sign up and see how it all works on W&B, it takes just five lines of code–plus it's free to get started. 
﻿
﻿
﻿
Fine-Tuning a Transformer ModelTo get started with fine-tuning your transformer model, you first need a model to deploy! For our exercise today, we'll work with a sequence classification task and grab a pre-trained Distilbert model:
import wandb
from transformers import AutoModelForSequenceClassification, AutoTokenizer
﻿
wandb.init(project="hf-end-to-end-deployment", job_type="model-upload")
﻿
﻿
model = AutoModelForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=2)
tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')
﻿
model.save_pretrained('sequence-classification')
tokenizer.save_pretrained('sequence-classification')
﻿
artifact = wandb.Artifact("sequence-classification-model", type="model")
artifact.add_dir("sequence-classification")
wandb.log_artifact(artifact)
﻿
wandb.finish()
The next thing we want to do is fine-tune this model on a dataset. We chose an IMDB dataset. We then used the HuggingFace trainer and its integration with W&B to train the model, track metrics, and save model checkpoints:
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import DataCollatorWithPadding
from transformers import Trainer, TrainingArguments
import wandb
﻿
imdb = load_dataset("imdb")
tokenizer = AutoTokenizer.from_pretrained("./sequence-classification")
tokenizer.add_special_tokens({'pad_token': '[PAD]'})
model = AutoModelForSequenceClassification.from_pretrained("./sequence-classification", num_labels=2)
﻿
def preprocess(examples):
    return tokenizer(examples["text"], truncation=True)
﻿
tokenizer_dataset = imdb.map(preprocess, batched=True)
﻿
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
﻿
run = wandb.init(project="hf-end-to-end-deployment", entity="int_pb", job_type="finetuning")
wandb.use_artifact("int_pb/hf-end-to-end-deployment/sequence-classification-model:latest")
﻿
training_args = TrainingArguments(
    output_dir="./results",
    learning_rate=2e-5,
    per_device_train_batch_size=64,
    per_device_eval_batch_size=64,
    num_train_epochs=5,
    weight_decay=0.01,
    report_to="wandb",
)
﻿
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenizer_dataset["train"],
    eval_dataset=tokenizer_dataset["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)
trainer.train()
The code above generated the plots below:
﻿
﻿
Run set13
﻿
﻿
This fine-tuned model is used for all the following steps.
Converting the Model to a Suitable FormatPyTorch supports dynamic computation graphs, which means that the computation graph is created at run time. In the case of Tensorflow, the static computation graph is created on compilation before run time. Though both methods have their own pros and cons, at inference we want a graph which is fast and optimal. That's our static computation graph. The fixed nature of these graphs over them up to faster computation and a wide variety of possible optimizations. However, the dynamic computation approach is more suitable for debugging and prototyping. 
What if we could prototype the model with a dynamic graph and then convert it to a static graph? That is exactly what we will do using ONNX and TensorRT!
ONNX (Open Neural Network Exchange) - ONNX is an open format built to represent machine learning models. You can convert any model in the framework of your choice into a static computation graph format which can then be used with a variety of tools, runtimes, and compilers.
TensorRT - TensorRT by NVIDIA is a tool for high-performance deep learning inference optimized heavily for inference on Nvidia GPUs. Currently, it is most likely the fastest way to perform inference.
We'll start with ONNX:
Converting to ONNXIn the previous section, we obtained a fine-tuned Distilbert model for sequence classification and logged it W&B as an Artifact. In this section, we'll go through how you can pull that model and convert it to ONNX in three different ways!
Each of these ways takes the same model as input and produces the same ONNX model as output (we found the final approach we outline below to be the easiest). You can see the lineage of these models in the panel below:
﻿
﻿
Run set13
﻿
Try changing the style to complete lineage in the panel above and zoom out for the complete workflow.
💡
Transformers ONNX ExportThe transformers library has its own ONNX submodule specifically for converting transformers to the ONNX format. In the following snippet, we load our pretrained model and use the transformers.onnx.export utility to save the ONNX model:
from pathlib import Path
import transformers
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from transformers.onnx import FeaturesManager
import wandb
﻿
# Download the fine-tuned model from W&B artifacts
run = wandb.init(project="hf-end-to-end-deployment", job_type="convert_to_onnx")
artifact = wandb.use_artifact("int_pb/hf-end-to-end-deployment/model-1lv0fpmm:v0")
path = artifact.download()
﻿
model = AutoModelForSequenceClassification.from_pretrained(path, num_labels=2)
tokenizer = AutoTokenizer.from_pretrained(path)
﻿
feature= 'sequence-classification'
model_kind, model_onnx_config = FeaturesManager.check_supported_model_or_raise(model, feature=feature) 
onnx_config = model_onnx_config(model.config)
﻿
onnx_inputs, onnx_outputs = transformers.onnx.export(
        preprocessor=tokenizer,
        model=model,
        config=onnx_config,
        opset=12,
        output=Path("model.onnx")
)
﻿
# Log the generated model to W&B
onnx_artifact = wandb.Artifact("sequence-classification-onnx", type="onnx-model")
onnx_artifact.add_file("model.onnx")
﻿
wandb.log_artifact(onnx_artifact)
wandb.finish()
Torch ExportAnother possible option (though it is a bit more complex) is using the ONNX utility built into PyTorch. In this case, you have to manually specify the inputs, outputs, and dynamic_axes to make the ONNX model open to inputs of different shapes.
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from typing import Dict
import wandb
﻿
run = wandb.init(project="hf-end-to-end-deployment", job_type="convert_to_onnx")
artifact = wandb.use_artifact("int_pb/hf-end-to-end-deployment/model-1lv0fpmm:v0")
path = artifact.download()
﻿
model = AutoModelForSequenceClassification.from_pretrained(path, num_labels=2)
tokenizer = AutoTokenizer.from_pretrained(path)
﻿
def convert_to_onnx(model, output_path: str, inputs: Dict[str, torch.Tensor]):
    torch.onnx.export(
        model,
        args=(inputs["input_ids"], inputs["attention_mask"]),
        f=output_path,
        opset_version=12,
        do_constant_folding=True,
        input_names=["input_ids", "attention_mask"],
        output_names=["model_output"],
        dynamic_axes={
            "input_ids": {0: "batch_size", 1: "sequence"},
            "attention_mask": {0: "batch_size", 1: "sequence"},
            "model_output": {0: "batch_size"},
        },
        verbose=False,
    )
﻿
inputs = tokenizer("This is a test", return_tensors="pt")
convert_to_onnx(model, "model.onnx", inputs)
﻿
onnx_artifact = wandb.Artifact("sequence-classification-onnx", type="model")
artifact.add_file("model.onnx")
﻿
wandb.log_artifact(onnx_artifact)
wandb.finish()
HuggingFace OptimumThe easiest way for us was using the Optimum library from HuggingFace. After all, it's built specifically to convert and optimize transformer models. 
All you need to do is provide the path of your trained model to ORTModelForSequenceClassification.from_pretrained and set the from_transformers flag to True and save this model:
from optimum.onnxruntime import ORTModelForSequenceClassification
from transformers import AutoTokenizer
import wandb
﻿
# Download the fine-tuned model from W&B artifacts
run = wandb.init(project="hf-end-to-end-deployment", job_type="convert_to_onnx", entity="int_pb")
artifact = wandb.use_artifact("int_pb/hf-end-to-end-deployment/model-1lv0fpmm:v0")
path = artifact.download()
﻿
model = ORTModelForSequenceClassification.from_pretrained(path, from_transformers=True)
﻿
model.save_pretrained(f'sequence-classfication-ort/')
﻿
# Log the model to W&B
artifact = wandb.Artifact("sequence-classification-optimum-onnx", type="model")
artifact.add_file(f'sequence-classfication-ort/model.onnx')
﻿
wandb.log_artifact(artifact)
wandb.finish()
Optimizing the Model for Fast InferenceNow that we've converted our model to ONNX or TensorRT (i.e. a static graph representation), there are many ways it can be optimized for faster inference. These include:
Constant Folding: Evaluating all the constants at compile time so those values are pre-computed at run time.
Redundant Node Elimination: Eliminating the redundant nodes without changing the graph structure.
Operator Fusion: Merge different operator nodes into one so they can be executed together.
Quantization: In this case, the data and model weights initially in FP32 format are changed to lower precision INT8 format. Though this increases the speed two-fold, it also reduces the accuracy of the model.
Optimizing the ONNX Model with ONNXRuntimeONNXRuntime provides an interface for optimizing the computation graphs. It also has a utility for performing transformer-specific optimizations but only for very few models (namely BERT, BART, and GPT-2). 
In the following snippet, we start an ONNX inference session and set the graph optimization level to enable all optimizations. Then we quantize the model using the quantize_dynamic function built into the onnxruntime library:
import onnxruntime as rt
from onnxruntime.quantization import quantize_dynamic, QuantType
import wandb
﻿
# Download the fine-tuned model from W&B artifacts
run = wandb.init(project="hf-end-to-end-deployment", entity='int_pb', job_type='optimize_onnx')
artifact = run.use_artifact("int_pb/hf-end-to-end-deployment/sequence-classification-transformers-onnx:v0")
path = artifact.download()
﻿
sess_options = rt.SessionOptions()
﻿
# Set graph optimization level
sess_options.graph_optimization_level = rt.GraphOptimizationLevel.ORT_ENABLE_ALL
﻿
# To enable model serialization after graph optimization set this
sess_options.optimized_model_filepath = "optimized_model.onnx"
﻿
session = rt.InferenceSession(f"{path}/model.onnx", sess_options)
﻿
model_fp32 = 'optimized_model.onnx'
model_quant = './model.quant.onnx'
quantized_model = quantize_dynamic(model_fp32, model_quant)
﻿
# Log the model to W&B
artifact = wandb.Artifact("sequence-classification-onnx-quantized", type='model')
artifact.add_file('./model.quant.onnx')
wandb.log_artifact(artifact)
wandb.finish()
﻿
Run set13
﻿
Optimizing the ONNX Model With OptimumOptimum comes with optimizations and quantization specific to transformers and hence provides the most straightforward way to optimize your model. The code panel below will walk you through that in detail. 
Essentially: we'll download and load the fine-tuned model and initialize the ORTOptimizer object by passing the model as an argument so that it can load the model graph. Then we'll create an optimization config with the optimization_level set to 99, which means that all possible optimizations should be applied.
Now that the graph optimizations are done, the next step is to quantize the model. We'll instantiate the ORTQuantizer with the path to the optimized ONNX model. 
For the config, we use dynamic quantization:
from optimum.onnxruntime import ORTModelForSequenceClassification, ORTOptimizer, ORTQuantizer
from optimum.onnxruntime.configuration import OptimizationConfig, AutoQuantizationConfig
from pathlib import Path
import wandb
﻿
# Download the ONNX model from W&B artifacts
run = wandb.init(project="hf-end-to-end-deployment", entity="int_pb", job_type="optimize_onnx")
artifact = run.use_artifact('int_pb/hf-end-to-end-deployment/sequence-classification-optimum-onnx:v0', type='model')
artifact_dir = artifact.download()
﻿
﻿
model = ORTModelForSequenceClassification.from_pretrained(artifact_dir)
﻿
﻿
optimizer = ORTOptimizer.from_pretrained(model)
optimization_config = OptimizationConfig(optimization_level=99)
﻿
﻿
optimized_path = optimizer.optimize(optimization_config, save_dir="./sequence-classification-ort-optimized")
﻿
﻿
dynamic_quantizer = ORTQuantizer.from_pretrained(optimized_path, file_name="model_optimized.onnx")
dqconfig = AutoQuantizationConfig.avx512_vnni(is_static=False, per_channel=False)
﻿
﻿
path = dynamic_quantizer.quantize(dqconfig, save_dir="./sequence-classification-ort-quantized")
﻿
# Log the Optimized ONNX model to W&B
artifact = wandb.Artifact('sequence-classification-onnx-optimum-quantized', type='model')
artifact.add_dir(path)
﻿
﻿
wandb.log_artifact(artifact)
wandb.finish()
﻿
Then we save the quantized model followed by logging it as a W&B Artifact. You can see the complete lineage in the panel below (feel free to zoom in & out).
﻿
project("int_pb", "hf-end-to-end-deployment").artifact("sequence-classification-onnx-optimum-quantized")
sequence-classification-onnx-optimum-quantizedVersions
All Versions
Aliases
latest
Versions
v0
Artifact overview
Type
model
Created At
November 9th, 2022
Description
Versions
1-1
 of 1
Version
Aliases
Logged By
Tags
Created
TTL Remaining
# of Consuming Runs
Size
0
latest
v0
avid-smoke-17
Wed Nov 09 2022
Inactive
1
170.5MB
Loading...
TensorRTTensorRT is a machine learning framework that is published by NVIDIA to run machine learning inference on their hardware. TensorRT is highly optimized to run on NVIDIA GPUs and is a further optimization on ONNX. It is likely the fastest way to run inference at the moment.
During the process of writing this article, we tried to find the easiest ways of converting Transformer models to TensorRT format. In case you want to try to install TensorRT from scratch, we strongly recommend using the Polygraphy module built on top of it. However, the easiest way we found was using this amazing repository!
It has pre-built methods to convert your transformer model to ONNX and TensorRT. Moreover, it automatically creates config files for further inference needs. All you need for this is an installation of docker and the Transformer Deploy library!
Clone the Transformer Deploy library and pull its docker image:
git clone git@github.com:ELS-RD/transformer-deploy.git
cd transformer-deploy
# docker image may take a few minutes
docker pull ghcr.io/els-rd/transformer-deploy:0.5.3 
Pull the fine-tuned model:
# Download the fine-tuned model from W&B artifacts
import wandb
﻿
api = wandb.Api()
artifact = api.artifact("int_pb/hf-end-to-end-deployment/model-1lv0fpmm:v0")
path = artifact.download()
Use the downloaded docker image to convert and optimize your model into a TensorRT format by running the following command in your working directory. (However, this comes with a caveat that this will perform only FP16 quantization and not INT8 because that requires calibrating the model on some data for appropriate quantization).
docker run -it --rm --gpus all \
  -v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.5.3 \
  bash -c "cd /project && \
    convert_model -m ./artifacts/model-1lv0fpmm:v0 \
    --backend tensorrt onnx \
    --seq-len 16 128 128"
Upload the model as a W&B Artifact:
import wandb
﻿
run = wandb.init(project="hf-end-to-end-deployment", entity="int_pb", job_type="convert_to_tensorRT")
artifact = run.use_artifact("int_pb/hf-end-to-end-deployment/model-1lv0fpmm:v0")
﻿
triton_artifact = wandb.Artifact("sequence-classification-tensorrt", type="tensorrt-model")
triton_artifact.add_dir('triton_models')
wandb.log_artifact(triton_artifact)
wandb.finish(
﻿
﻿
Run set13
﻿
INT8 Quantization (Click to Expand)
BenchmarksFor benchmarking all the models created while writing this report, we'll use two metrics: Samples per second (for speed) and Accuracy. To calculate the accuracy, we'll use the test set of the IMDB dataset on which we had initially fine-tuned our transformer model.
Does the Method for Converting an ONNX Model Make a Difference?﻿
Run set13
﻿
Turns out that you can use any of the Torch ONNX export, Transformers ONNX export or the Optimum ONNX export, and it would make no discernable difference in terms of performance. The accuracy of all three models is identical, and the number of samples processed per second is also almost identical.
How does Optimization Help?﻿
Run set13
﻿
In this case, we're comparing the speed and accuracy of the ONNX models at different stages of optimization. The graph optimizations give a minor speed bump with the same accuracy, but after quantization we see a massive increase in speed with a slightly lower accuracy.
You might see the speed comparison above and think why is this so much slower than what we saw in the previous section. The reason for this is that quantization optimizations are hardware specific and for GPUs they can currently be done only though TensorRT.
💡
Comparison with TensorRTFinally, we'll compare the difference in speed between all the methods we saw so far. TensorRT, as expected, is the fastest due to its high optimization for inference––though it may lead to loss of accuracy.
﻿
Run set13
﻿
﻿Transformer Deploy﻿﻿﻿As we explored all the ways to make HuggingFace transformer models suitable for inference, we found that most of them are hard to work around and extremely flaky, especially due to hardware-specific optimizations. The Transformer Deploy library turned out to be a saving grace, with great documentation and practically everything being shipped out of the box.
Converting to ONNX and OptimizationThe following snippet will start by downloading the model and tokenizer from W&B. The tokenizer is used to create a dummy input for creating the static computation graph. The convert_to_onnx function then automatically converts the model to ONNX format, and applies graph optimizations and FP16 quantization.
import wandb
from transformer_deploy.backends.pytorch_utils import convert_to_onnx
from transformers import AutoModelForSequenceClassification, AutoTokenizer
﻿
# Download the fine-tuned model and tokenizer from W&B artifacts
api = wandb.Api()
model_path = api.artifact('int_pb/hf-end-to-end-deployment/model-1lv0fpmm:v0').download()
tokenizer_path = api.artifact('int_pb/hf-end-to-end-deployment/sequence-classification').download()
﻿
tokenizer = AutoTokenizer.from_pretrained(tokenizer_path)
model = AutoModelForSequenceClassification.from_pretraiend(model_path)
﻿
data = "This is a test"
input_torch = tokenizer(data, return_tensor="pt")
convert_to_onnx(
    model_pytorch=model,
    output_path="model_qat.onnx",
    inputs_pytorch=input_torch,
    quantization=True,
    var_output_seq=False,
    output_names=["output"],
)
Converting to TensorRTThe addition of the following snippet to the aforementioned snippet will convert your model to a TensorRT engine with FP16 quantization:
from transformer_deploy.backends.trt_utils import build_engine, save_engine
import tensorrt as trt
﻿
trt_logger: Logger = trt.Logger(trt.Logger.ERROR)
runtime: Runtime = trt.Runtime(trt_logger)
batch_size = 32
max_seq_len = 256
﻿
engine = build_engine(
    runtime=runtime,
    onnx_file_path="model_qat.onnx",
    logger=trt_logger,
    min_shape=(1, max_seq_len),
    optimal_shape=(batch_size, max_seq_len),
    max_shape=(batch_size, max_seq_len),
    workspace_size=10000 * 1024 * 1024,
    fp16=True,
    int8=False,
)
save_engine(engine, "model.trt")
ConclusionIn conclusion, there is a variety of ways to get to the fastest and most accurate model for inference in production, but it does require a bit of experimentation because some optimizations may lead to a loss of accuracy. The entire process boils down to 3 steps
Convert your model to a static graph (ONNX)
Make graph optimizations to the static graph
Quantize the computation graph to FP16 or INT8
Run Inference!
However, all the tools we mentioned in this report come with their pros and cons. For example, ONNX provides a common format for most ML frameworks and supports different types of hardware accelerators, but it doesn't currently support transformer-specific optimizations for most transformer architectures and supports INT8 quantization only on CPUs. Meanwhile, TensorRT is extremely fast and supports all the optimizations in ONNX and more, but it can be extremely hard to set up manually and works only on the most recent hardware from NVIDIA so it isn't compatible with older GPUs.
References﻿Basics of Quantization in ML﻿
﻿Optimizing Transformers with Hugging Face Optimum﻿
﻿Graph Optimizations in ONNX Runtime﻿
﻿Nvidia GPU INT-8 quantization on any transformers model﻿
﻿
Add a comment
Tags: HuggingFace, Articles, Experiment, Advanced, Plots, Panels, Tutorial
Iterate on AI agents and models faster. Try Weights & Biases today.
Compare Methods for Converting and Optimizing HuggingFace Models for Deployment

Table of Contents

Why Are We Converting and Optimizing Our PyTorch and Keras Models?

Fine-Tuning a Transformer Model

Converting the Model to a Suitable Format

Converting to ONNX

Transformers ONNX Export

Torch Export

HuggingFace Optimum

Optimizing the Model for Fast Inference

Optimizing the ONNX Model with ONNXRuntime

Optimizing the ONNX Model With Optimum

TensorRT

INT8 Quantization (Click to Expand)

Benchmarks

Does the Method for Converting an ONNX Model Make a Difference?

How does Optimization Help?

Comparison with TensorRT

﻿Transformer Deploy﻿﻿﻿

Converting to ONNX and Optimization

Converting to TensorRT

Conclusion

References

Transformer Deploy