Hugging Face Optimun

01 of june of 2024

Disclaimer: This post has been translated to English using a machine translation model. Please, let me know if you find any mistakes.

Optimun is an extension of the Transformers library that provides a set of performance optimization tools for training and inference of models, on specific hardware, with maximum efficiency.

The AI ecosystem is evolving rapidly, and every day more specialized hardware emerges along with its own optimizations. Therefore, Optimum allows users to efficiently utilize any of this HW with the same ease as Transformers.

Optimun allows optimization for the following HW platforms:

Nvidia
AMD
Intel
AWS
TPU
Havana
FuriousAI

In addition, it offers acceleration for the following open source integrations

ONNX runtime
Exporters: Export Pytorch or TensorFlow models to different formats such as ONNX or TFLite
BetterTransformer
Torch FX

Installation

To install Optimun simply run:

pip install optimum

But if you want to install with support for all HW platforms, you can do it like this

Accelerator	Installation
ONNX Runtime	`pip install --upgrade --upgrade-strategy eager optimum[onnxruntime]`
Intel Neural Compressor	`pip install --upgrade --upgrade-strategy eager optimum[neural-compressor]`
OpenVINO	`pip install --upgrade --upgrade-strategy eager optimum[openvino]`
AMD Instinct GPUs and Ryzen AI NPU	`pip install --upgrade --upgrade-strategy eager optimum[amd]`
AWS Trainum & Inferentia	`pip install --upgrade --upgrade-strategy eager optimum[neuronx]`
Habana Gaudi Processor (HPU)	`pip install --upgrade --upgrade-strategy eager optimum[habana]`
FuriosaAI	`pip install --upgrade --upgrade-strategy eager optimum[furiosa]`

The flags --upgrade --upgrade-strategy eager are necessary to ensure that the different packages are updated to the latest possible version.

Since most people use Pytorch on Nvidia GPUs, and especially since I have an Nvidia GPU, this post will only discuss the use of Optimun with Nvidia GPUs and Pytorch.

BetterTransformer

BetterTransformer is a native PyTorch optimization to achieve an acceleration of x1.25 to x4 in the inference of Transformer-based models.

BetterTransformer is an API that allows leveraging modern hardware features to accelerate the training and inference of transformer models in PyTorch, using more efficient attention implementations and the fast path of the native nn.TransformerEncoderLayer.

BetterTransformer uses two types of accelerations:

Flash Attention: This is an implementation of attention that uses sparse to reduce computational complexity. Attention is one of the most expensive operations in transformer models, and Flash Attention makes it more efficient.
Memory-Efficient Attention: This is another implementation of attention that uses the scaled_dot_product_attention function from PyTorch. This function is more memory-efficient than the standard attention implementation in PyTorch.

In addition, version 2.0 of PyTorch includes a native scaled dot product attention (SDPA) operator as part of torch.nn.functional

Optimun provides this functionality with the library Transformers

Inference with Automodel

First, let's see how normal inference would work with Transformers and Automodel

	
		from transformers import AutoTokenizer, AutoModelForCausalLM
 
checkpoint = "openai-community/gpt2"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto")
 
tokenizer.pad_token = tokenizer.eos_token
 
input_tokens = tokenizer(["Me encanta aprender de"], return_tensors="pt", padding=True).to("cuda")
output_tokens = model.generate(**input_tokens, max_length=50)
 
sentence_output = tokenizer.decode(output_tokens[0], skip_special_tokens=True)
sentence_output

	
		Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.

	
		'Me encanta aprender de la vie de la vie de la vie de la vie de la vie de la vie de la vie de la vie de la vie de la vie de la vie de'

Now we see how it would be optimized with BetterTransformer and Optimun

What we need to do is convert the model using the transform method of BetterTransformer

	
		from transformers import AutoTokenizer, AutoModelForCausalLM
from optimum.bettertransformer import BetterTransformer
 
checkpoint = "openai-community/gpt2"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model_hf = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto")
 
# Convert the model to a BetterTransformer model
model = BetterTransformer.transform(model_hf, keep_original_model=True)
 
tokenizer.pad_token = tokenizer.eos_token
 
input_tokens = tokenizer(["Me encanta aprender de"], return_tensors="pt", padding=True).to("cuda")
output_tokens = model.generate(**input_tokens, max_length=50)
 
sentence_output = tokenizer.decode(output_tokens[0], skip_special_tokens=True)
sentence_output

	
		The BetterTransformer implementation does not support padding during training, as the fused kernels do not support attention masks. Beware that passing padded batched data during training may result in unexpected outputs. Please refer to https://huggingface.co/docs/optimum/bettertransformer/overview for more details.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.

	
		'Me encanta aprender de la vie de la vie de la vie de la vie de la vie de la vie de la vie de la vie de la vie de la vie de la vie de'

Inference with Pipeline

Just like before, we first see how normal inference would be done with Transformers and Pipeline

	
		from transformers import pipeline
 
pipe = pipeline(task="fill-mask", model="distilbert-base-uncased")
pipe("I am a student at [MASK] University.")

	
		[{'score': 0.05116177722811699,
  'token': 8422,
  'token_str': 'stanford',
  'sequence': 'i am a student at stanford university.'},
{'score': 0.04033993184566498,
  'token': 5765,
  'token_str': 'harvard',
  'sequence': 'i am a student at harvard university.'},
{'score': 0.03990468755364418,
  'token': 7996,
  'token_str': 'yale',
  'sequence': 'i am a student at yale university.'},
{'score': 0.0361952930688858,
  'token': 10921,
  'token_str': 'cornell',
  'sequence': 'i am a student at cornell university.'},
{'score': 0.03303057327866554,
  'token': 9173,
  'token_str': 'princeton',
  'sequence': 'i am a student at princeton university.'}]

Now we see how to optimize it, for this we use pipeline from Optimun, instead of the one from Transformers. Additionally, we need to specify that we want to use bettertransformer as the accelerator.

	
		from optimum.pipelines import pipeline
 
# Use the BetterTransformer pipeline
pipe = pipeline(task="fill-mask", model="distilbert-base-uncased", accelerator="bettertransformer")
pipe("I am a student at [MASK] University.")

	
		The BetterTransformer implementation does not support padding during training, as the fused kernels do not support attention masks. Beware that passing padded batched data during training may result in unexpected outputs. Please refer to https://huggingface.co/docs/optimum/bettertransformer/overview for more details.
/home/wallabot/miniconda3/envs/nlp/lib/python3.11/site-packages/optimum/bettertransformer/models/encoder_models.py:868: UserWarning: The PyTorch API of nested tensors is in prototype stage and will change in the near future. (Triggered internally at /opt/conda/conda-bld/pytorch_1708025845868/work/aten/src/ATen/NestedTensorImpl.cpp:177.)
  hidden_states = torch._nested_tensor_from_mask(hidden_states, attn_mask)

	
		[{'score': 0.05116180703043938,
  'token': 8422,
  'token_str': 'stanford',
  'sequence': 'i am a student at stanford university.'},
{'score': 0.040340032428503036,
  'token': 5765,
  'token_str': 'harvard',
  'sequence': 'i am a student at harvard university.'},
{'score': 0.039904672652482986,
  'token': 7996,
  'token_str': 'yale',
  'sequence': 'i am a student at yale university.'},
{'score': 0.036195311695337296,
  'token': 10921,
  'token_str': 'cornell',
  'sequence': 'i am a student at cornell university.'},
{'score': 0.03303062543272972,
  'token': 9173,
  'token_str': 'princeton',
  'sequence': 'i am a student at princeton university.'}]

Training

For the training with Optimun we do the same as with the inference with Automodel, we convert the model using the transform method of BeterTransformer.

When we finish the training, we revert the model back to its original form using the reverse method of BetterTransformer, so that we can save it and upload it to the Hugging Face hub.

	
		from transformers import AutoTokenizer, AutoModelForCausalLM
from optimum.bettertransformer import BetterTransformer
 
checkpoint = "openai-community/gpt2"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model_hf = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto")
 
# Convert the model to a BetterTransformer model
model = BetterTransformer.transform(model_hf, keep_original_model=True)
 
##############################################################################
# do your training here
##############################################################################
 
# Convert the model back to a Hugging Face model
model_hf = BetterTransformer.reverse(model)
 
model_hf.save_pretrained("fine_tuned_model")
model_hf.push_to_hub("fine_tuned_model")

Continue reading

MCP: Complete Guide to Create servers and clients MCP (Model Context Protocol) with FastMCP

Learn what is the Model Context Protocol (MCP), the open-source standard developed by Anthropic that revolutionizes how AI models interact with external tools. In this practical and detailed guide, I take you step by step in creating an MCP server and client from scratch using the fastmcp library. You will build an "intelligent" AI agent with Claude Sonnet, capable of interacting with the GitHub API to query issues and repository information. We will cover from basic concepts to advanced features like filtering tools by tags, server composition, static resources and dynamic templates (resource templates), prompt generation, and secure authentication. Discover how MCP can standardize and simplify the integration of tools in your AI applications, analogously to how USB unified peripherals!

MCP: Model Context Protocol

Agents patterns

Are your agents falling short? Elevate your AI projects with advanced patterns: ReAct, planning, multi-agents, and more. Practical guide with code!

Last posts -->

Have you seen these projects?

Horeca chatbot

Naviground

Subtify

View all projects -->

Do you want to apply AI in your project? Contact me!

Do you want to improve with these tips?

Memory profiler

See the memory usage of a script

DataLoader with pin_memory and num_workers

Increase DataLoader performance with pin_memory and num_workers

py-smi

Python library to get GPU data like `nvidia-smi`

Last tips -->

Use this locally

Hugging Face spaces allow us to run models with very simple demos, but what if the demo breaks? Or if the user deletes it? That's why I've created docker containers with some interesting spaces, to be able to use them locally, whatever happens. In fact, if you click on any project view button, it may take you to a space that doesn't work.