Hugging Face Optimun
Optimum is an extension of the [Transformers] library (https://maximofn.com/hugging-face-transformers/) that provides a set of performance optimization tools for training and inference models on specific hardware with maximum efficiency.
The AI ecosystem is evolving rapidly and more specialized hardware is emerging every day along with its own optimizations. Therefore, Optimum
allows users to efficiently utilize any of this HW with the same ease as Transformers.
This notebook has been automatically translated to make it accessible to more people, please let me know if you see any typos.
Optimmun` allows optimization for the following HW platforms:
- Nvidia
- AMD
- Intel
- AWS
- TPU
- Havana
- FuriosaAI
It also offers acceleration for the following open source integrations
- ONNX runtime
- Exporters: Export Pytorch or TensorFlow data to different formats such as ONNX or TFLite
- BetterTransformer
- Torch FX
Installation
To install Optimum
simply run:
pip install optimum
But if you want to install it with support for all HW platforms, you can do it like this
Accelerator | Installation |
---|---|
ONNX Runtime | pip install --upgrade --upgrade-strategy eager optimum[onnxruntime] |
Intel Neural Compressor | pip install --upgrade --upgrade-strategy eager optimum[neural-compressor] |
OpenVINO | pip install --upgrade --upgrade-strategy eager optimum[openvino] |
NVIDIA TensorRT-LLM | docker run -it --gpus all --ipc host huggingface/optimum-nvidia |
AMD Instinct GPUs and Ryzen AI NPU | pip install --upgrade --upgrade-strategy eager optimum[amd] |
AWS Trainum & Inferentia | pip install --upgrade --upgrade-strategy eager optimum[neuronx] |
Havana Gaudi Processor (HPU) | pip install --upgrade --upgrade-strategy eager optimum [habana] |
FuriosaAI | pip install --upgrade --upgrade-strategy eager optimum[furiosa] |
the --upgrade --upgrade-strategy eager
flags are needed to ensure that the different packages are upgraded to the latest possible version.
Since most people use Pytorch on Nvidia GPUs, and especially since Nvidia is what I have, this post is going to talk only about the use of Optimun
with Nvidia GPUs and Pytorch.
BeterTransformer
BetterTransformer is a native PyTorch optimization for x1.25 to x4 speedup in Transformer-based model inference.
BetterTransformer is an API that allows you to take advantage of modern hardware features to accelerate the training and inference of transformer models in PyTorch, using more efficient and fast path
attention implementations of the native nn.TransformerEncoderLayer
version of nn.TransformerEncoderLayer
.
BetterTransformer uses two types of accelerations:
Flash Attention
: This is an implementation ofattention
that usessparse
to reduce computational complexity. Attention is one of the most expensive operations in transformer models, andFlash Attention
makes it more efficient.Memory-Efficient Attention
: This is another implementation of attention that uses thescaled_dot_product_attention
function of PyTorch. This function is more memory-efficient than the standard PyTorch implementation of attention.
In addition, PyTorch version 2.0 includes a native scaled point product attention operator (SDPA) as part of torch.nn.functional
.
Optimmun provides this functionality with the Transformers
library.
Inference with Automodel
First let's see how the normal inference would be with Transformers
and Automodel
.
from transformers import AutoTokenizer, AutoModelForCausalLM
checkpoint = "openai-community/gpt2"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto")
tokenizer.pad_token = tokenizer.eos_token
input_tokens = tokenizer(["Me encanta aprender de"], return_tensors="pt", padding=True).to("cuda")
output_tokens = model.generate(**input_tokens, max_length=50)
sentence_output = tokenizer.decode(output_tokens[0], skip_special_tokens=True)
sentence_output
Now we see how it would be optimized with BetterTransformer
and Optimun
.
What we have to do is to convert the model using the transform
method of BeterTransformer
.
from transformers import AutoTokenizer, AutoModelForCausalLM
from optimum.bettertransformer import BetterTransformer
checkpoint = "openai-community/gpt2"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model_hf = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto")
# Convert the model to a BetterTransformer model
model = BetterTransformer.transform(model_hf, keep_original_model=True)
tokenizer.pad_token = tokenizer.eos_token
input_tokens = tokenizer(["Me encanta aprender de"], return_tensors="pt", padding=True).to("cuda")
output_tokens = model.generate(**input_tokens, max_length=50)
sentence_output = tokenizer.decode(output_tokens[0], skip_special_tokens=True)
sentence_output
Inferecncy with Pipeline
As before, we first see how the normal inference would be with Transformers
and Pipeline
.
from transformers import AutoTokenizer, AutoModelForCausalLMcheckpoint = "openai-community/gpt2"tokenizer = AutoTokenizer.from_pretrained(checkpoint)model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto")tokenizer.pad_token = tokenizer.eos_tokeninput_tokens = tokenizer(["Me encanta aprender de"], return_tensors="pt", padding=True).to("cuda")output_tokens = model.generate(**input_tokens, max_length=50)sentence_output = tokenizer.decode(output_tokens[0], skip_special_tokens=True)sentence_outputfrom transformers import AutoTokenizer, AutoModelForCausalLMfrom optimum.bettertransformer import BetterTransformercheckpoint = "openai-community/gpt2"tokenizer = AutoTokenizer.from_pretrained(checkpoint)model_hf = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto")# Convert the model to a BetterTransformer modelmodel = BetterTransformer.transform(model_hf, keep_original_model=True)tokenizer.pad_token = tokenizer.eos_tokeninput_tokens = tokenizer(["Me encanta aprender de"], return_tensors="pt", padding=True).to("cuda")output_tokens = model.generate(**input_tokens, max_length=50)sentence_output = tokenizer.decode(output_tokens[0], skip_special_tokens=True)sentence_outputfrom transformers import pipelinepipe = pipeline(task="fill-mask", model="distilbert-base-uncased")pipe("I am a student at [MASK] University.")
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.The BetterTransformer implementation does not support padding during training, as the fused kernels do not support attention masks. Beware that passing padded batched data during training may result in unexpected outputs. Please refer to https://huggingface.co/docs/optimum/bettertransformer/overview for more details.Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.[{'score': 0.05116177722811699,'token': 8422,'token_str': 'stanford','sequence': 'i am a student at stanford university.'},{'score': 0.04033993184566498,'token': 5765,'token_str': 'harvard','sequence': 'i am a student at harvard university.'},{'score': 0.03990468755364418,'token': 7996,'token_str': 'yale','sequence': 'i am a student at yale university.'},{'score': 0.0361952930688858,'token': 10921,'token_str': 'cornell','sequence': 'i am a student at cornell university.'},{'score': 0.03303057327866554,'token': 9173,'token_str': 'princeton','sequence': 'i am a student at princeton university.'}]
Now we see how to optimize it, for this we use pipeline
of Optimun
, instead of Transformers
. In addition we must indicate that we want to use bettertransformer
as accelerator.
from optimum.pipelines import pipeline
# Use the BetterTransformer pipeline
pipe = pipeline(task="fill-mask", model="distilbert-base-uncased", accelerator="bettertransformer")
pipe("I am a student at [MASK] University.")
Training
For training with Optimun
we do the same as with Automodel inference, we convert the model using the transform
method of BeterTransformer
.
When we finish the training, we convert the model back using the reverse
method of BeterTransformer
to get the original model back so we can save it and upload it to the Hugging Face hub.
from optimum.pipelines import pipeline# Use the BetterTransformer pipelinepipe = pipeline(task="fill-mask", model="distilbert-base-uncased", accelerator="bettertransformer")pipe("I am a student at [MASK] University.")from transformers import AutoTokenizer, AutoModelForCausalLMfrom optimum.bettertransformer import BetterTransformercheckpoint = "openai-community/gpt2"tokenizer = AutoTokenizer.from_pretrained(checkpoint)model_hf = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto")# Convert the model to a BetterTransformer modelmodel = BetterTransformer.transform(model_hf, keep_original_model=True)############################################################################### do your training here############################################################################### Convert the model back to a Hugging Face modelmodel_hf = BetterTransformer.reverse(model)model_hf.save_pretrained("fine_tuned_model")model_hf.push_to_hub("fine_tuned_model")