Compare Methods for Converting and Optimizing HuggingFace Models for Deployment
In this article, we'll walk through how to convert trained HuggingFace models to slimmer, leaner models for deployment with code examples.
Created on November 7|Last edited on January 10
Comment
In this article, we're going to walk through how to convert trained PyTorch and Keras models to slimmer, leaner models for deployment. Let's start with the "why" and then move on to the code.
Most people train their models using Keras or PyTorch. The question we want to answer today is why you can't just load those models in the app and use them for inference.
Before we dive in, here's what we'll be covering:
Table of Contents
Why Are We Converting and Optimizing Our PyTorch and Keras Models?Fine-Tuning a Transformer ModelConverting the Model to a Suitable FormatOptimizing the Model for Fast InferenceBenchmarksConclusionReferences
Let's get going.
Why Are We Converting and Optimizing Our PyTorch and Keras Models?
The reason to convert and optimized PyTorch and Keras models is that these libraries by themselves are extremely large and, in a lot of cases, so are the models. This makes deploying them for inference slower, more memory intensive, and ultimately more expensive.
If you're hoping the model can be deployed on an IoT device, there's simply no practical way to deploy a PyTorch model on a small device.
The good news? There are a couple of ways to take your trained models, convert them to a format more suitable for inference, and then optimize those models for faster inference!
In this article, we'll walk you through how you can take a pre-trained Transformer from HuggingFace 🤗, fine-tune it on the task of your choice, convert it to ONNX or TensorRT and finally optimize it to make it ready for deployment.
If you'd like to try any of this on your own, all the code in this report can be copied and run on an environment of your choice. And, if you'd like to sign up and see how it all works on W&B, it takes just five lines of code–plus it's free to get started.
Fine-Tuning a Transformer Model
To get started with fine-tuning your transformer model, you first need a model to deploy! For our exercise today, we'll work with a sequence classification task and grab a pre-trained Distilbert model:
import wandbfrom transformers import AutoModelForSequenceClassification, AutoTokenizerwandb.init(project="hf-end-to-end-deployment", job_type="model-upload")model = AutoModelForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=2)tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')model.save_pretrained('sequence-classification')tokenizer.save_pretrained('sequence-classification')artifact = wandb.Artifact("sequence-classification-model", type="model")artifact.add_dir("sequence-classification")wandb.log_artifact(artifact)wandb.finish()
The next thing we want to do is fine-tune this model on a dataset. We chose an IMDB dataset. We then used the HuggingFace trainer and its integration with W&B to train the model, track metrics, and save model checkpoints:
from datasets import load_datasetfrom transformers import AutoTokenizer, AutoModelForSequenceClassificationfrom transformers import DataCollatorWithPaddingfrom transformers import Trainer, TrainingArgumentsimport wandbimdb = load_dataset("imdb")tokenizer = AutoTokenizer.from_pretrained("./sequence-classification")tokenizer.add_special_tokens({'pad_token': '[PAD]'})model = AutoModelForSequenceClassification.from_pretrained("./sequence-classification", num_labels=2)def preprocess(examples):return tokenizer(examples["text"], truncation=True)tokenizer_dataset = imdb.map(preprocess, batched=True)data_collator = DataCollatorWithPadding(tokenizer=tokenizer)run = wandb.init(project="hf-end-to-end-deployment", entity="int_pb", job_type="finetuning")wandb.use_artifact("int_pb/hf-end-to-end-deployment/sequence-classification-model:latest")training_args = TrainingArguments(output_dir="./results",learning_rate=2e-5,per_device_train_batch_size=64,per_device_eval_batch_size=64,num_train_epochs=5,weight_decay=0.01,report_to="wandb",)trainer = Trainer(model=model,args=training_args,train_dataset=tokenizer_dataset["train"],eval_dataset=tokenizer_dataset["test"],tokenizer=tokenizer,data_collator=data_collator,)trainer.train()
The code above generated the plots below:
Run set
13
This fine-tuned model is used for all the following steps.
Converting the Model to a Suitable Format
PyTorch supports dynamic computation graphs, which means that the computation graph is created at run time. In the case of Tensorflow, the static computation graph is created on compilation before run time. Though both methods have their own pros and cons, at inference we want a graph which is fast and optimal. That's our static computation graph. The fixed nature of these graphs over them up to faster computation and a wide variety of possible optimizations. However, the dynamic computation approach is more suitable for debugging and prototyping.
What if we could prototype the model with a dynamic graph and then convert it to a static graph? That is exactly what we will do using ONNX and TensorRT!
- ONNX (Open Neural Network Exchange) - ONNX is an open format built to represent machine learning models. You can convert any model in the framework of your choice into a static computation graph format which can then be used with a variety of tools, runtimes, and compilers.
- TensorRT - TensorRT by NVIDIA is a tool for high-performance deep learning inference optimized heavily for inference on Nvidia GPUs. Currently, it is most likely the fastest way to perform inference.
We'll start with ONNX:
Converting to ONNX
In the previous section, we obtained a fine-tuned Distilbert model for sequence classification and logged it W&B as an Artifact. In this section, we'll go through how you can pull that model and convert it to ONNX in three different ways!
Each of these ways takes the same model as input and produces the same ONNX model as output (we found the final approach we outline below to be the easiest). You can see the lineage of these models in the panel below:
Run set
13
Try changing the style to complete lineage in the panel above and zoom out for the complete workflow.
💡
Transformers ONNX Export
The transformers library has its own ONNX submodule specifically for converting transformers to the ONNX format. In the following snippet, we load our pretrained model and use the transformers.onnx.export utility to save the ONNX model:
from pathlib import Pathimport transformersfrom transformers import AutoModelForSequenceClassification, AutoTokenizerfrom transformers.onnx import FeaturesManagerimport wandb# Download the fine-tuned model from W&B artifactsrun = wandb.init(project="hf-end-to-end-deployment", job_type="convert_to_onnx")artifact = wandb.use_artifact("int_pb/hf-end-to-end-deployment/model-1lv0fpmm:v0")path = artifact.download()model = AutoModelForSequenceClassification.from_pretrained(path, num_labels=2)tokenizer = AutoTokenizer.from_pretrained(path)feature= 'sequence-classification'model_kind, model_onnx_config = FeaturesManager.check_supported_model_or_raise(model, feature=feature)onnx_config = model_onnx_config(model.config)onnx_inputs, onnx_outputs = transformers.onnx.export(preprocessor=tokenizer,model=model,config=onnx_config,opset=12,output=Path("model.onnx"))# Log the generated model to W&Bonnx_artifact = wandb.Artifact("sequence-classification-onnx", type="onnx-model")onnx_artifact.add_file("model.onnx")wandb.log_artifact(onnx_artifact)wandb.finish()
Torch Export
Another possible option (though it is a bit more complex) is using the ONNX utility built into PyTorch. In this case, you have to manually specify the inputs, outputs, and dynamic_axes to make the ONNX model open to inputs of different shapes.
import torchfrom transformers import AutoModelForSequenceClassification, AutoTokenizerfrom typing import Dictimport wandbrun = wandb.init(project="hf-end-to-end-deployment", job_type="convert_to_onnx")artifact = wandb.use_artifact("int_pb/hf-end-to-end-deployment/model-1lv0fpmm:v0")path = artifact.download()model = AutoModelForSequenceClassification.from_pretrained(path, num_labels=2)tokenizer = AutoTokenizer.from_pretrained(path)def convert_to_onnx(model, output_path: str, inputs: Dict[str, torch.Tensor]):torch.onnx.export(model,args=(inputs["input_ids"], inputs["attention_mask"]),f=output_path,opset_version=12,do_constant_folding=True,input_names=["input_ids", "attention_mask"],output_names=["model_output"],dynamic_axes={"input_ids": {0: "batch_size", 1: "sequence"},"attention_mask": {0: "batch_size", 1: "sequence"},"model_output": {0: "batch_size"},},verbose=False,)inputs = tokenizer("This is a test", return_tensors="pt")convert_to_onnx(model, "model.onnx", inputs)onnx_artifact = wandb.Artifact("sequence-classification-onnx", type="model")artifact.add_file("model.onnx")wandb.log_artifact(onnx_artifact)wandb.finish()
HuggingFace Optimum
The easiest way for us was using the Optimum library from HuggingFace. After all, it's built specifically to convert and optimize transformer models.
All you need to do is provide the path of your trained model to ORTModelForSequenceClassification.from_pretrained and set the from_transformers flag to True and save this model:
from optimum.onnxruntime import ORTModelForSequenceClassificationfrom transformers import AutoTokenizerimport wandb# Download the fine-tuned model from W&B artifactsrun = wandb.init(project="hf-end-to-end-deployment", job_type="convert_to_onnx", entity="int_pb")artifact = wandb.use_artifact("int_pb/hf-end-to-end-deployment/model-1lv0fpmm:v0")path = artifact.download()model = ORTModelForSequenceClassification.from_pretrained(path, from_transformers=True)model.save_pretrained(f'sequence-classfication-ort/')# Log the model to W&Bartifact = wandb.Artifact("sequence-classification-optimum-onnx", type="model")artifact.add_file(f'sequence-classfication-ort/model.onnx')wandb.log_artifact(artifact)wandb.finish()
Optimizing the Model for Fast Inference
Now that we've converted our model to ONNX or TensorRT (i.e. a static graph representation), there are many ways it can be optimized for faster inference. These include:
- Constant Folding: Evaluating all the constants at compile time so those values are pre-computed at run time.
- Redundant Node Elimination: Eliminating the redundant nodes without changing the graph structure.
- Operator Fusion: Merge different operator nodes into one so they can be executed together.
- Quantization: In this case, the data and model weights initially in FP32 format are changed to lower precision INT8 format. Though this increases the speed two-fold, it also reduces the accuracy of the model.
Optimizing the ONNX Model with ONNXRuntime
ONNXRuntime provides an interface for optimizing the computation graphs. It also has a utility for performing transformer-specific optimizations but only for very few models (namely BERT, BART, and GPT-2).
In the following snippet, we start an ONNX inference session and set the graph optimization level to enable all optimizations. Then we quantize the model using the quantize_dynamic function built into the onnxruntime library:
import onnxruntime as rtfrom onnxruntime.quantization import quantize_dynamic, QuantTypeimport wandb# Download the fine-tuned model from W&B artifactsrun = wandb.init(project="hf-end-to-end-deployment", entity='int_pb', job_type='optimize_onnx')artifact = run.use_artifact("int_pb/hf-end-to-end-deployment/sequence-classification-transformers-onnx:v0")path = artifact.download()sess_options = rt.SessionOptions()# Set graph optimization levelsess_options.graph_optimization_level = rt.GraphOptimizationLevel.ORT_ENABLE_ALL# To enable model serialization after graph optimization set thissess_options.optimized_model_filepath = "optimized_model.onnx"session = rt.InferenceSession(f"{path}/model.onnx", sess_options)model_fp32 = 'optimized_model.onnx'model_quant = './model.quant.onnx'quantized_model = quantize_dynamic(model_fp32, model_quant)# Log the model to W&Bartifact = wandb.Artifact("sequence-classification-onnx-quantized", type='model')artifact.add_file('./model.quant.onnx')wandb.log_artifact(artifact)wandb.finish()
Run set
13
Optimizing the ONNX Model With Optimum
Optimum comes with optimizations and quantization specific to transformers and hence provides the most straightforward way to optimize your model. The code panel below will walk you through that in detail.
Essentially: we'll download and load the fine-tuned model and initialize the ORTOptimizer object by passing the model as an argument so that it can load the model graph. Then we'll create an optimization config with the optimization_level set to 99, which means that all possible optimizations should be applied.
Now that the graph optimizations are done, the next step is to quantize the model. We'll instantiate the ORTQuantizer with the path to the optimized ONNX model.
For the config, we use dynamic quantization:
from optimum.onnxruntime import ORTModelForSequenceClassification, ORTOptimizer, ORTQuantizerfrom optimum.onnxruntime.configuration import OptimizationConfig, AutoQuantizationConfigfrom pathlib import Pathimport wandb# Download the ONNX model from W&B artifactsrun = wandb.init(project="hf-end-to-end-deployment", entity="int_pb", job_type="optimize_onnx")artifact = run.use_artifact('int_pb/hf-end-to-end-deployment/sequence-classification-optimum-onnx:v0', type='model')artifact_dir = artifact.download()model = ORTModelForSequenceClassification.from_pretrained(artifact_dir)optimizer = ORTOptimizer.from_pretrained(model)optimization_config = OptimizationConfig(optimization_level=99)optimized_path = optimizer.optimize(optimization_config, save_dir="./sequence-classification-ort-optimized")dynamic_quantizer = ORTQuantizer.from_pretrained(optimized_path, file_name="model_optimized.onnx")dqconfig = AutoQuantizationConfig.avx512_vnni(is_static=False, per_channel=False)path = dynamic_quantizer.quantize(dqconfig, save_dir="./sequence-classification-ort-quantized")# Log the Optimized ONNX model to W&Bartifact = wandb.Artifact('sequence-classification-onnx-optimum-quantized', type='model')artifact.add_dir(path)wandb.log_artifact(artifact)wandb.finish()
Then we save the quantized model followed by logging it as a W&B Artifact. You can see the complete lineage in the panel below (feel free to zoom in & out).
sequence-classification-onnx-optimum-quantized
Artifact overview
Type
model
Created At
November 9th, 2022
Description
Versions
Version
Aliases
Logged By
Tags
Created
TTL Remaining
# of Consuming Runs
Size
Loading...
TensorRT
TensorRT is a machine learning framework that is published by NVIDIA to run machine learning inference on their hardware. TensorRT is highly optimized to run on NVIDIA GPUs and is a further optimization on ONNX. It is likely the fastest way to run inference at the moment.
During the process of writing this article, we tried to find the easiest ways of converting Transformer models to TensorRT format. In case you want to try to install TensorRT from scratch, we strongly recommend using the Polygraphy module built on top of it. However, the easiest way we found was using this amazing repository!
It has pre-built methods to convert your transformer model to ONNX and TensorRT. Moreover, it automatically creates config files for further inference needs. All you need for this is an installation of docker and the Transformer Deploy library!
Clone the Transformer Deploy library and pull its docker image:
git clone git@github.com:ELS-RD/transformer-deploy.gitcd transformer-deploy# docker image may take a few minutesdocker pull ghcr.io/els-rd/transformer-deploy:0.5.3
Pull the fine-tuned model:
# Download the fine-tuned model from W&B artifactsimport wandbapi = wandb.Api()artifact = api.artifact("int_pb/hf-end-to-end-deployment/model-1lv0fpmm:v0")path = artifact.download()
Use the downloaded docker image to convert and optimize your model into a TensorRT format by running the following command in your working directory. (However, this comes with a caveat that this will perform only FP16 quantization and not INT8 because that requires calibrating the model on some data for appropriate quantization).
docker run -it --rm --gpus all \-v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.5.3 \bash -c "cd /project && \convert_model -m ./artifacts/model-1lv0fpmm:v0 \--backend tensorrt onnx \--seq-len 16 128 128"
Upload the model as a W&B Artifact:
import wandbrun = wandb.init(project="hf-end-to-end-deployment", entity="int_pb", job_type="convert_to_tensorRT")artifact = run.use_artifact("int_pb/hf-end-to-end-deployment/model-1lv0fpmm:v0")triton_artifact = wandb.Artifact("sequence-classification-tensorrt", type="tensorrt-model")triton_artifact.add_dir('triton_models')wandb.log_artifact(triton_artifact)wandb.finish(
Run set
13
INT8 Quantization (Click to Expand)
Benchmarks
For benchmarking all the models created while writing this report, we'll use two metrics: Samples per second (for speed) and Accuracy. To calculate the accuracy, we'll use the test set of the IMDB dataset on which we had initially fine-tuned our transformer model.
Does the Method for Converting an ONNX Model Make a Difference?
Run set
13
Turns out that you can use any of the Torch ONNX export, Transformers ONNX export or the Optimum ONNX export, and it would make no discernable difference in terms of performance. The accuracy of all three models is identical, and the number of samples processed per second is also almost identical.
How does Optimization Help?
Run set
13
In this case, we're comparing the speed and accuracy of the ONNX models at different stages of optimization. The graph optimizations give a minor speed bump with the same accuracy, but after quantization we see a massive increase in speed with a slightly lower accuracy.
You might see the speed comparison above and think why is this so much slower than what we saw in the previous section. The reason for this is that quantization optimizations are hardware specific and for GPUs they can currently be done only though TensorRT.
💡
Comparison with TensorRT
Finally, we'll compare the difference in speed between all the methods we saw so far. TensorRT, as expected, is the fastest due to its high optimization for inference––though it may lead to loss of accuracy.
Run set
13
Transformer Deploy
As we explored all the ways to make HuggingFace transformer models suitable for inference, we found that most of them are hard to work around and extremely flaky, especially due to hardware-specific optimizations. The Transformer Deploy library turned out to be a saving grace, with great documentation and practically everything being shipped out of the box.
Converting to ONNX and Optimization
The following snippet will start by downloading the model and tokenizer from W&B. The tokenizer is used to create a dummy input for creating the static computation graph. The convert_to_onnx function then automatically converts the model to ONNX format, and applies graph optimizations and FP16 quantization.
import wandbfrom transformer_deploy.backends.pytorch_utils import convert_to_onnxfrom transformers import AutoModelForSequenceClassification, AutoTokenizer# Download the fine-tuned model and tokenizer from W&B artifactsapi = wandb.Api()model_path = api.artifact('int_pb/hf-end-to-end-deployment/model-1lv0fpmm:v0').download()tokenizer_path = api.artifact('int_pb/hf-end-to-end-deployment/sequence-classification').download()tokenizer = AutoTokenizer.from_pretrained(tokenizer_path)model = AutoModelForSequenceClassification.from_pretraiend(model_path)data = "This is a test"input_torch = tokenizer(data, return_tensor="pt")convert_to_onnx(model_pytorch=model,output_path="model_qat.onnx",inputs_pytorch=input_torch,quantization=True,var_output_seq=False,output_names=["output"],)
Converting to TensorRT
The addition of the following snippet to the aforementioned snippet will convert your model to a TensorRT engine with FP16 quantization:
from transformer_deploy.backends.trt_utils import build_engine, save_engineimport tensorrt as trttrt_logger: Logger = trt.Logger(trt.Logger.ERROR)runtime: Runtime = trt.Runtime(trt_logger)batch_size = 32max_seq_len = 256engine = build_engine(runtime=runtime,onnx_file_path="model_qat.onnx",logger=trt_logger,min_shape=(1, max_seq_len),optimal_shape=(batch_size, max_seq_len),max_shape=(batch_size, max_seq_len),workspace_size=10000 * 1024 * 1024,fp16=True,int8=False,)save_engine(engine, "model.trt")
Conclusion
In conclusion, there is a variety of ways to get to the fastest and most accurate model for inference in production, but it does require a bit of experimentation because some optimizations may lead to a loss of accuracy. The entire process boils down to 3 steps
- Convert your model to a static graph (ONNX)
- Make graph optimizations to the static graph
- Quantize the computation graph to FP16 or INT8
- Run Inference!
However, all the tools we mentioned in this report come with their pros and cons. For example, ONNX provides a common format for most ML frameworks and supports different types of hardware accelerators, but it doesn't currently support transformer-specific optimizations for most transformer architectures and supports INT8 quantization only on CPUs. Meanwhile, TensorRT is extremely fast and supports all the optimizations in ONNX and more, but it can be extremely hard to set up manually and works only on the most recent hardware from NVIDIA so it isn't compatible with older GPUs.
References
Add a comment
Iterate on AI agents and models faster. Try Weights & Biases today.