Hugging Face Optimun

Hugging Face Optimun Hugging Face Optimun

Hugging Face Optimunlink image 6

Optimum is an extension of the [Transformers] library (https://maximofn.com/hugging-face-transformers/) that provides a set of performance optimization tools for training and inference models on specific hardware with maximum efficiency.

The AI ecosystem is evolving rapidly and more specialized hardware is emerging every day along with its own optimizations. Therefore, Optimum allows users to efficiently utilize any of this HW with the same ease as Transformers.

This notebook has been automatically translated to make it accessible to more people, please let me know if you see any typos.

Optimmun` allows optimization for the following HW platforms:

  • Nvidia
  • AMD
  • Intel
  • AWS
  • TPU
  • Havana
  • FuriosaAI

It also offers acceleration for the following open source integrations

  • ONNX runtime
  • Exporters: Export Pytorch or TensorFlow data to different formats such as ONNX or TFLite
  • BetterTransformer
  • Torch FX

Installationlink image 7

To install Optimum simply run:

pip install optimum
      

But if you want to install it with support for all HW platforms, you can do it like this

Accelerator Installation
ONNX Runtime pip install --upgrade --upgrade-strategy eager optimum[onnxruntime]
Intel Neural Compressor pip install --upgrade --upgrade-strategy eager optimum[neural-compressor]
OpenVINO pip install --upgrade --upgrade-strategy eager optimum[openvino]
NVIDIA TensorRT-LLM docker run -it --gpus all --ipc host huggingface/optimum-nvidia
AMD Instinct GPUs and Ryzen AI NPU pip install --upgrade --upgrade-strategy eager optimum[amd]
AWS Trainum & Inferentia pip install --upgrade --upgrade-strategy eager optimum[neuronx]
Havana Gaudi Processor (HPU) pip install --upgrade --upgrade-strategy eager optimum [habana]
FuriosaAI pip install --upgrade --upgrade-strategy eager optimum[furiosa]

the --upgrade --upgrade-strategy eager flags are needed to ensure that the different packages are upgraded to the latest possible version.

Since most people use Pytorch on Nvidia GPUs, and especially since Nvidia is what I have, this post is going to talk only about the use of Optimun with Nvidia GPUs and Pytorch.

BeterTransformerlink image 8

BetterTransformer is a native PyTorch optimization for x1.25 to x4 speedup in Transformer-based model inference.

BetterTransformer is an API that allows you to take advantage of modern hardware features to accelerate the training and inference of transformer models in PyTorch, using more efficient and fast path attention implementations of the native nn.TransformerEncoderLayer version of nn.TransformerEncoderLayer.

BetterTransformer uses two types of accelerations:

  1. Flash Attention: This is an implementation of attention that uses sparse to reduce computational complexity. Attention is one of the most expensive operations in transformer models, and Flash Attention makes it more efficient.
  2. Memory-Efficient Attention: This is another implementation of attention that uses the scaled_dot_product_attention function of PyTorch. This function is more memory-efficient than the standard PyTorch implementation of attention.

In addition, PyTorch version 2.0 includes a native scaled point product attention operator (SDPA) as part of torch.nn.functional.

Optimmun provides this functionality with the Transformers library.

Inference with Automodellink image 9

First let's see how the normal inference would be with Transformers and Automodel.

from transformers import AutoTokenizer, AutoModelForCausalLM
      
      checkpoint = "openai-community/gpt2"
      tokenizer = AutoTokenizer.from_pretrained(checkpoint)
      model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto")
      
      tokenizer.pad_token = tokenizer.eos_token
      
      input_tokens = tokenizer(["Me encanta aprender de"], return_tensors="pt", padding=True).to("cuda")
      output_tokens = model.generate(**input_tokens, max_length=50)
      
      sentence_output = tokenizer.decode(output_tokens[0], skip_special_tokens=True)
      sentence_output
      
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
      
Out[1]:
'Me encanta aprender de la vie de la vie de la vie de la vie de la vie de la vie de la vie de la vie de la vie de la vie de la vie de'

Now we see how it would be optimized with BetterTransformer and Optimun.

What we have to do is to convert the model using the transform method of BeterTransformer.

from transformers import AutoTokenizer, AutoModelForCausalLM
      from optimum.bettertransformer import BetterTransformer
      
      checkpoint = "openai-community/gpt2"
      tokenizer = AutoTokenizer.from_pretrained(checkpoint)
      model_hf = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto")
      
      # Convert the model to a BetterTransformer model
      model = BetterTransformer.transform(model_hf, keep_original_model=True)
      
      tokenizer.pad_token = tokenizer.eos_token
      
      input_tokens = tokenizer(["Me encanta aprender de"], return_tensors="pt", padding=True).to("cuda")
      output_tokens = model.generate(**input_tokens, max_length=50)
      
      sentence_output = tokenizer.decode(output_tokens[0], skip_special_tokens=True)
      sentence_output
      
The BetterTransformer implementation does not support padding during training, as the fused kernels do not support attention masks. Beware that passing padded batched data during training may result in unexpected outputs. Please refer to https://huggingface.co/docs/optimum/bettertransformer/overview for more details.
      Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
      
Out[2]:
'Me encanta aprender de la vie de la vie de la vie de la vie de la vie de la vie de la vie de la vie de la vie de la vie de la vie de'

Inferecncy with Pipelinelink image 10

As before, we first see how the normal inference would be with Transformers and Pipeline.

	
from transformers import AutoTokenizer, AutoModelForCausalLM
checkpoint = "openai-community/gpt2"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto")
tokenizer.pad_token = tokenizer.eos_token
input_tokens = tokenizer(["Me encanta aprender de"], return_tensors="pt", padding=True).to("cuda")
output_tokens = model.generate(**input_tokens, max_length=50)
sentence_output = tokenizer.decode(output_tokens[0], skip_special_tokens=True)
sentence_output
from transformers import AutoTokenizer, AutoModelForCausalLM
from optimum.bettertransformer import BetterTransformer
checkpoint = "openai-community/gpt2"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model_hf = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto")
# Convert the model to a BetterTransformer model
model = BetterTransformer.transform(model_hf, keep_original_model=True)
tokenizer.pad_token = tokenizer.eos_token
input_tokens = tokenizer(["Me encanta aprender de"], return_tensors="pt", padding=True).to("cuda")
output_tokens = model.generate(**input_tokens, max_length=50)
sentence_output = tokenizer.decode(output_tokens[0], skip_special_tokens=True)
sentence_output
from transformers import pipeline
pipe = pipeline(task="fill-mask", model="distilbert-base-uncased")
pipe("I am a student at [MASK] University.")
Copy
	
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The BetterTransformer implementation does not support padding during training, as the fused kernels do not support attention masks. Beware that passing padded batched data during training may result in unexpected outputs. Please refer to https://huggingface.co/docs/optimum/bettertransformer/overview for more details.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
[{'score': 0.05116177722811699,
'token': 8422,
'token_str': 'stanford',
'sequence': 'i am a student at stanford university.'},
{'score': 0.04033993184566498,
'token': 5765,
'token_str': 'harvard',
'sequence': 'i am a student at harvard university.'},
{'score': 0.03990468755364418,
'token': 7996,
'token_str': 'yale',
'sequence': 'i am a student at yale university.'},
{'score': 0.0361952930688858,
'token': 10921,
'token_str': 'cornell',
'sequence': 'i am a student at cornell university.'},
{'score': 0.03303057327866554,
'token': 9173,
'token_str': 'princeton',
'sequence': 'i am a student at princeton university.'}]

Now we see how to optimize it, for this we use pipeline of Optimun, instead of Transformers. In addition we must indicate that we want to use bettertransformer as accelerator.

from optimum.pipelines import pipeline
      
      # Use the BetterTransformer pipeline
      pipe = pipeline(task="fill-mask", model="distilbert-base-uncased", accelerator="bettertransformer")
      pipe("I am a student at [MASK] University.")
      
The BetterTransformer implementation does not support padding during training, as the fused kernels do not support attention masks. Beware that passing padded batched data during training may result in unexpected outputs. Please refer to https://huggingface.co/docs/optimum/bettertransformer/overview for more details.
      /home/wallabot/miniconda3/envs/nlp/lib/python3.11/site-packages/optimum/bettertransformer/models/encoder_models.py:868: UserWarning: The PyTorch API of nested tensors is in prototype stage and will change in the near future. (Triggered internally at /opt/conda/conda-bld/pytorch_1708025845868/work/aten/src/ATen/NestedTensorImpl.cpp:177.)
        hidden_states = torch._nested_tensor_from_mask(hidden_states, attn_mask)
      
Out[4]:
[{'score': 0.05116180703043938,
        'token': 8422,
        'token_str': 'stanford',
        'sequence': 'i am a student at stanford university.'},
       {'score': 0.040340032428503036,
        'token': 5765,
        'token_str': 'harvard',
        'sequence': 'i am a student at harvard university.'},
       {'score': 0.039904672652482986,
        'token': 7996,
        'token_str': 'yale',
        'sequence': 'i am a student at yale university.'},
       {'score': 0.036195311695337296,
        'token': 10921,
        'token_str': 'cornell',
        'sequence': 'i am a student at cornell university.'},
       {'score': 0.03303062543272972,
        'token': 9173,
        'token_str': 'princeton',
        'sequence': 'i am a student at princeton university.'}]

Traininglink image 11

For training with Optimun we do the same as with Automodel inference, we convert the model using the transform method of BeterTransformer.

When we finish the training, we convert the model back using the reverse method of BeterTransformer to get the original model back so we can save it and upload it to the Hugging Face hub.

	
from optimum.pipelines import pipeline
# Use the BetterTransformer pipeline
pipe = pipeline(task="fill-mask", model="distilbert-base-uncased", accelerator="bettertransformer")
pipe("I am a student at [MASK] University.")
from transformers import AutoTokenizer, AutoModelForCausalLM
from optimum.bettertransformer import BetterTransformer
checkpoint = "openai-community/gpt2"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model_hf = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto")
# Convert the model to a BetterTransformer model
model = BetterTransformer.transform(model_hf, keep_original_model=True)
##############################################################################
# do your training here
##############################################################################
# Convert the model back to a Hugging Face model
model_hf = BetterTransformer.reverse(model)
model_hf.save_pretrained("fine_tuned_model")
model_hf.push_to_hub("fine_tuned_model")
Copy

Continue reading

DoLa – Decoding by Contrasting Layers Improves Factuality in Large Language Models

DoLa – Decoding by Contrasting Layers Improves Factuality in Large Language Models

Have you ever talked to an LLM and they answered you something that sounds like they've been drinking machine coffee all night long 😂 That's what we call a hallucination in the LLM world! But don't worry, because it's not that your language model is crazy (although it can sometimes seem that way 🤪). The truth is that LLMs can be a bit... creative when it comes to generating text. But thanks to DoLa, a method that uses contrast layers to improve the feasibility of LLMs, we can keep our language models from turning into science fiction writers 😂. In this post, I'll explain how DoLa works and show you a code example so you can better understand how to make your LLMs more reliable and less prone to making up stories. Let's save our LLMs from insanity and make them more useful! 🚀

Last posts -->

Have you seen these projects?

Subtify

Subtify Subtify

Subtitle generator for videos in the language you want. Also, it puts a different color subtitle to each person

View all projects -->

Do you want to apply AI in your project? Contact me!

Do you want to improve with these tips?

Last tips -->

Use this locally

Hugging Face spaces allow us to run models with very simple demos, but what if the demo breaks? Or if the user deletes it? That's why I've created docker containers with some interesting spaces, to be able to use them locally, whatever happens. In fact, if you click on any project view button, it may take you to a space that doesn't work.

View all containers -->

Do you want to apply AI in your project? Contact me!

Do you want to train your model with these datasets?

short-jokes-dataset

Dataset with jokes in English

opus100

Dataset with translations from English to Spanish

netflix_titles

Dataset with Netflix movies and series

View more datasets -->