HuggingFace Accelerate

HuggingFace Accelerate HuggingFace Accelerate

Hugging Face Acceleratelink image 28

Accelerate is a Hugging Face library that allows you to run the same PyTorch code in any distributed configuration by adding only four lines of code.

This notebook has been automatically translated to make it accessible to more people, please let me know if you see any typos.

Installationlink image 29

To install accelerate with pip simply run:

pip install accelerate

And with conda:

conda install -c conda-forge accelerate

Configurationlink image 30

In every environment in which accelerate is installed, the first thing to do is to configure it, for that we execute in a terminal:

accelerate config
!accelerate config
In which compute environment are you running?
This machine
How many different machines will you use (use more than 1 for multi-node training)? [1]: 1
Should distributed operations be checked while running for errors? This can avoid timeout issues but will be slower. [yes/NO]: no
Do you wish to optimize your script with torch dynamo?[yes/NO]:no
Do you want to use DeepSpeed? [yes/NO]: no
Do you want to use FullyShardedDataParallel? [yes/NO]: no
Do you want to use Megatron-LM ? [yes/NO]: no
How many GPU(s) should be used for distributed training? [1]:2
What GPU(s) (by id) should be used for training on this machine as a comma-seperated list? [all]:0,1
Do you wish to use FP16 or BF16 (mixed precision)?
accelerate configuration saved at ~/.cache/huggingface/accelerate/default_config.yaml

In my case the answers have been

  • In which compute environment are you running?
    • "This machine"
    • [_] "AWS (Amazon SageMaker)"

I want to configure it on my computer

  • Which type of machine are you using?
    • [_] multi-CPU
    • [_] multi-XPU
    • x] multi-GPU
    • [_] multi-NPU
    • [_] TPU

As I have 2 GPUs and I want to run distributed codes on them I choose multi-GPU.

  • How many different machines will you use (use more than 1 for multi-node training)? [1]:
    • 1

I choose 1 because I am only going to run on my computer.

  • Should distributed operations be checked while running for errors? This can avoid timeout issues but will be slower. [yes/NO]:
    • no

With this option, you can choose to have accelerate check for errors on execution, but it would slow it down, so I choose no, and in case there are errors I change it to yes.

  • Do you wish to optimize your script with torch dynamo? [yes/NO]:

    • no
  • Do you want to use FullyShardedDataParallel? [yes/NO]:

    • no
  • Do you want to use Megatron-LM ? [yes/NO]:

    • no
  • How many GPU(s) should be used for distributed training? [1]:

I choose 2 because I have 2 GPUs

  • What GPU(s) (by id) should be used for training on this machine as a comma-seperated list? [all]:
    • 0,1

I choose 0,1 because I want to use both GPUs.

  • Do you wish to use FP16 or BF16 (mixed precision)?
    • x] no
    • [_] fp16
    • [_] bf16
    • [_] fp8

For the moment I choose no, because to simplify the code when not using accelerate we are going to train on fp32, but ideally we should use fp16

The configuration will be stored in ~/.cache/huggingface/accelerate/default_config.yaml and can be modified at any time. Let's see what's inside

!cat ~/.cache/huggingface/accelerate/default_config.yaml
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: 0,1
machine_rank: 0
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Another way to see the configuration we have is to run it in a terminal:

accelerate env
!accelerate env
Copy-and-paste the text below in your GitHub issue
- `Accelerate` version: 0.28.0
- Platform: Linux-5.15.0-105-generic-x86_64-with-glibc2.31
- Python version: 3.11.8
- Numpy version: 1.26.4
- PyTorch version (GPU?): 2.2.1+cu121 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- System RAM: 31.24 GB
- GPU type: NVIDIA GeForce RTX 3090
- `Accelerate` default config:
- compute_environment: LOCAL_MACHINE
- distributed_type: MULTI_GPU
- mixed_precision: fp16
- use_cpu: False
- debug: False
- num_processes: 2
- machine_rank: 0
- num_machines: 1
- gpu_ids: 0,1
- rdzv_backend: static
- same_network: True
- main_training_function: main
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: []

Once we have configured accelerate we can test if we have done it right by running it in a terminal:

accelerate test
!accelerate test
Running: accelerate-launch ~/miniconda3/envs/nlp/lib/python3.11/site-packages/accelerate/test_utils/scripts/
stdout: **Initialization**
stdout: Testing, testing. 1, 2, 3.
stdout: Distributed environment: DistributedType.MULTI_GPU Backend: nccl
stdout: Num processes: 2
stdout: Process index: 0
stdout: Local process index: 0
stdout: Device: cuda:0
stdout: Mixed precision type: fp16
stdout: Distributed environment: DistributedType.MULTI_GPU Backend: nccl
stdout: Num processes: 2
stdout: Process index: 1
stdout: Local process index: 1
stdout: Device: cuda:1
stdout: Mixed precision type: fp16
stdout: **Test process execution**
stdout: **Test split between processes as a list**
stdout: **Test split between processes as a dict**
stdout: **Test split between processes as a tensor**
stdout: **Test random number generator synchronization**
stdout: All rng are properly synched.
stdout: **DataLoader integration test**
stdout: 0 1 tensor([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
stdout: 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35,
stdout: 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53,
stdout: 54, 55, 56, 57, 58, 59, 60, 61, 62, 63], device='cuda:1') <class 'accelerate.data_loader.DataLoaderShard'>
stdout: tensor([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
stdout: 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35,
stdout: 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53,
stdout: 54, 55, 56, 57, 58, 59, 60, 61, 62, 63], device='cuda:0') <class 'accelerate.data_loader.DataLoaderShard'>
stdout: Non-shuffled dataloader passing.
stdout: Shuffled dataloader passing.
stdout: Non-shuffled central dataloader passing.
stdout: Shuffled central dataloader passing.
stdout: **Training integration test**
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
stdout: Training yielded the same results on one CPU or distributed setup with no batch split.
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
stdout: Training yielded the same results on one CPU or distributes setup with batch split.
stdout: FP16 training check.
stdout: FP16 training check.
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
stdout: Keep fp32 wrapper check.
stdout: Keep fp32 wrapper check.
stdout: BF16 training check.
stdout: BF16 training check.
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
stdout: Training yielded the same results on one CPU or distributed setup with no batch split.
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
stdout: FP16 training check.
stdout: Training yielded the same results on one CPU or distributes setup with batch split.
stdout: FP16 training check.
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
stdout: Keep fp32 wrapper check.
stdout: Keep fp32 wrapper check.
stdout: BF16 training check.
stdout: BF16 training check.
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
stdout: **Breakpoint trigger test**
Test is a success! You are ready for your distributed training!

We see that it ends saying Test is a success! You are ready for your distributed training! so everything is correct.

Traininglink image 31

Training optimizationlink image 32

Base codelink image 33

We will first make a base training code and then optimize it to see how it is done and how it improves.

First let's look for a dataset, in my case I will use the dataset tweet_eval, which is a tweet classification dataset, specifically I will download the subset emoji which classifies tweets with emoticons.

from datasets import load_dataset
dataset = load_dataset("tweet_eval", "emoji")
train: Dataset({
features: ['text', 'label'],
num_rows: 45000
test: Dataset({
features: ['text', 'label'],
num_rows: 50000
validation: Dataset({
features: ['text', 'label'],
num_rows: 5000
DatasetInfo(description='', citation='', homepage='', license='', features={'text': Value(dtype='string', id=None), 'label': ClassLabel(names=['❤', '😍', '😂', '💕', '🔥', '😊', '😎', '✨', '💙', '😘', '📷', '🇺🇸', '☀', '💜', '😉', '💯', '😁', '🎄', '📸', '😜'], id=None)}, post_processed=None, supervised_keys=None, task_templates=None, builder_name='parquet', dataset_name='tweet_eval', config_name='emoji', version=0.0.0, splits={'train': SplitInfo(name='train', num_bytes=3808792, num_examples=45000, shard_lengths=None, dataset_name='tweet_eval'), 'test': SplitInfo(name='test', num_bytes=4262151, num_examples=50000, shard_lengths=None, dataset_name='tweet_eval'), 'validation': SplitInfo(name='validation', num_bytes=396704, num_examples=5000, shard_lengths=None, dataset_name='tweet_eval')}, download_checksums={'hf://datasets/tweet_eval@b3a375baf0f409c77e6bc7aa35102b7b3534f8be/emoji/train-00000-of-00001.parquet': {'num_bytes': 2609973, 'checksum': None}, 'hf://datasets/tweet_eval@b3a375baf0f409c77e6bc7aa35102b7b3534f8be/emoji/test-00000-of-00001.parquet': {'num_bytes': 3047341, 'checksum': None}, 'hf://datasets/tweet_eval@b3a375baf0f409c77e6bc7aa35102b7b3534f8be/emoji/validation-00000-of-00001.parquet': {'num_bytes': 281994, 'checksum': None}}, download_size=5939308, post_processing_size=None, dataset_size=8467647, size_in_bytes=14406955)

Let's take a look at the classes

['❤', '😍', '😂', '💕', '🔥', '😊', '😎', '✨', '💙', '😘', '📷', '🇺🇸', '☀', '💜', '😉', '💯', '😁', '🎄', '📸', '😜']

And the number of classes

num_classes = len(dataset["train"].info.features["label"].names)

We see that the dataset has 20 classes

Let's see the maximum sequence of each split

max_len_train = 0
max_len_val = 0
max_len_test = 0
split = "train"
for i in range(len(dataset[split])):
len_i = len(dataset[split][i]["text"])
if len_i > max_len_train:
max_len_train = len_i
split = "validation"
for i in range(len(dataset[split])):
len_i = len(dataset[split][i]["text"])
if len_i > max_len_val:
max_len_val = len_i
split = "test"
for i in range(len(dataset[split])):
len_i = len(dataset[split][i]["text"])
if len_i > max_len_test:
max_len_test = len_i
max_len_train, max_len_val, max_len_test
(142, 139, 167)

So we define the maximum sequence in general as 130 for tokeniaztion

max_len = 130

We are interested in the tokenized dataset, not the raw sequences, so we create a tokenizer

max_len = 130
from transformers import AutoTokenizer
checkpoints = "cardiffnlp/twitter-roberta-base-irony"
tokenizer = AutoTokenizer.from_pretrained(checkpoints)

We create a tokenization function

max_len = 130
from transformers import AutoTokenizer
checkpoints = "cardiffnlp/twitter-roberta-base-irony"
tokenizer = AutoTokenizer.from_pretrained(checkpoints)
def tokenize_function(dataset):
return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")

And now we tokenize the dataset

tokenized_dataset = {
          "train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),
          "validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),
          "test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),
Map:   0%|          | 0/45000 [00:00<?, ? examples/s]
Map:   0%|          | 0/5000 [00:00<?, ? examples/s]
Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

As we see now we have the tokens (input_ids) and the attention masks (attention_mask), but let's see what kind of data we have

max_len = 130
from transformers import AutoTokenizer
checkpoints = "cardiffnlp/twitter-roberta-base-irony"
tokenizer = AutoTokenizer.from_pretrained(checkpoints)
def tokenize_function(dataset):
return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")
tokenized_dataset = {
"train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),
"validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),
"test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),
type(tokenized_dataset["train"][0]["input_ids"]), type(tokenized_dataset["train"][0]["attention_mask"]), type(tokenized_dataset["train"][0]["label"])
Map: 0%| | 0/45000 [00:00<?, ? examples/s]
(list, list, int)
tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])
tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
type(tokenized_dataset["train"][0]["label"]), type(tokenized_dataset["train"][0]["input_ids"]), type(tokenized_dataset["train"][0]["attention_mask"])
(torch.Tensor, torch.Tensor, torch.Tensor)

We create a dataloader

import torch
from import DataLoader
BS = 64
dataloader = {
"train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),
"validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),
"test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),

We load the model

import torch
from import DataLoader
BS = 64
dataloader = {
"train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),
"validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),
"test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(checkpoints)

Let's see what the model looks like

import torch
from import DataLoader
BS = 64
dataloader = {
"train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),
"validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),
"test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(checkpoints)
(roberta): RobertaModel(
(embeddings): RobertaEmbeddings(
(word_embeddings): Embedding(50265, 768, padding_idx=1)
(position_embeddings): Embedding(514, 768, padding_idx=1)
(token_type_embeddings): Embedding(1, 768)
(LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
(encoder): RobertaEncoder(
(layer): ModuleList(
(0-11): 12 x RobertaLayer(
(attention): RobertaAttention(
(self): RobertaSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
(output): RobertaSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
(intermediate): RobertaIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
(intermediate_act_fn): GELUActivation()
(output): RobertaOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
(classifier): RobertaClassificationHead(
(dense): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
(out_proj): Linear(in_features=768, out_features=2, bias=True)

Let's take a look at its last layer

Linear(in_features=768, out_features=2, bias=True)
model.classifier.out_proj.in_features, model.classifier.out_proj.out_features
(768, 2)

We have seen that our dataset has 20 classes, but this model is trained for 2 classes, so we have to modify the last layer

model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)
Linear(in_features=768, out_features=20, bias=True)

Now it is

Now we create a loss function

loss_function = torch.nn.CrossEntropyLoss()

An optimizer

loss_function = torch.nn.CrossEntropyLoss()
from torch.optim import Adam
optimizer = Adam(model.parameters(), lr=5e-4)

And finally a metric

loss_function = torch.nn.CrossEntropyLoss()
from torch.optim import Adam
optimizer = Adam(model.parameters(), lr=5e-4)
import evaluate
metric = evaluate.load("accuracy")

Let's check that everything is all right with a sample

loss_function = torch.nn.CrossEntropyLoss()
from torch.optim import Adam
optimizer = Adam(model.parameters(), lr=5e-4)
import evaluate
metric = evaluate.load("accuracy")
sample = next(iter(dataloader["train"]))
loss_function = torch.nn.CrossEntropyLoss()
from torch.optim import Adam
optimizer = Adam(model.parameters(), lr=5e-4)
import evaluate
metric = evaluate.load("accuracy")
sample = next(iter(dataloader["train"]))
sample["input_ids"].shape, sample["attention_mask"].shape
(torch.Size([64, 130]), torch.Size([64, 130]))

Now we put that sample into the model"cuda")
ouputs = model(input_ids=sample["input_ids"].to("cuda"), attention_mask=sample["attention_mask"].to("cuda"))
torch.Size([64, 20])

We see that the model outputs 64 batches, which is fine, because we set BS = 20 and each with 20 outputs, which is fine because we changed the model to output 20 values.

We obtain the one with the highest value

predictions = torch.argmax(ouputs.logits, axis=-1)

We obtain the loss

loss = loss_function(ouputs.logits, sample["label"].to("cuda"))

And the accuracy

accuracy = metric.compute(predictions=predictions, references=sample["label"])["accuracy"]

We can now create a small training loop

from fastprogress.fastprogress import master_bar, progress_bar
      epochs = 1
      device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
      master_progress_bar = master_bar(range(epochs))
      for i in master_progress_bar:
          progress_bar_train = progress_bar(dataloader["train"], parent=master_progress_bar)
          for batch in progress_bar_train:
              input_ids = batch["input_ids"].to(device)
              attention_mask = batch["attention_mask"].to(device)
              labels = batch["label"].to(device)
              outputs = model(input_ids=input_ids, attention_mask=attention_mask)
              loss = loss_function(outputs['logits'], labels)
              master_progress_bar.child.comment = f'loss: {loss}'
          progress_bar_validation = progress_bar(dataloader["validation"], parent=master_progress_bar)
          for batch in progress_bar_validation:
              input_ids = batch["input_ids"].to(device)
              attention_mask = batch["attention_mask"].to(device)
              labels = batch["label"].to(device)
              with torch.no_grad():
                  outputs = model(input_ids=input_ids, attention_mask=attention_mask)
              predictions = torch.argmax(outputs['logits'], axis=-1)
              accuracy = metric.add_batch(predictions=predictions, references=labels)
          accuracy = metric.compute()
          master_progress_bar.main_bar.comment = f"Validation accuracy: {accuracy['accuracy']}\n"

Script with the code baselink image 34

In most of the accelerate documentation it is explained how to use accelerate with scripts, so for the moment we will do it like this and at the end we will explain how to do it with a notebook

First we are going to create a folder where we are going to save the scripts

from fastprogress.fastprogress import master_bar, progress_bar
epochs = 1
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
master_progress_bar = master_bar(range(epochs))
for i in master_progress_bar:
progress_bar_train = progress_bar(dataloader["train"], parent=master_progress_bar)
for batch in progress_bar_train:
input_ids = batch["input_ids"].to(device)
attention_mask = batch["attention_mask"].to(device)
labels = batch["label"].to(device)
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
loss = loss_function(outputs['logits'], labels)
master_progress_bar.child.comment = f'loss: {loss}'
progress_bar_validation = progress_bar(dataloader["validation"], parent=master_progress_bar)
for batch in progress_bar_validation:
input_ids = batch["input_ids"].to(device)
attention_mask = batch["attention_mask"].to(device)
labels = batch["label"].to(device)
with torch.no_grad():
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
predictions = torch.argmax(outputs['logits'], axis=-1)
accuracy = metric.add_batch(predictions=predictions, references=labels)
accuracy = metric.compute()
master_progress_bar.main_bar.comment = f"Validation accuracy: {accuracy['accuracy']} "
!mkdir accelerate_scripts

Now we write the base code in a script

from fastprogress.fastprogress import master_bar, progress_bar
epochs = 1
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
master_progress_bar = master_bar(range(epochs))
for i in master_progress_bar:
progress_bar_train = progress_bar(dataloader["train"], parent=master_progress_bar)
for batch in progress_bar_train:
input_ids = batch["input_ids"].to(device)
attention_mask = batch["attention_mask"].to(device)
labels = batch["label"].to(device)
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
loss = loss_function(outputs['logits'], labels)
master_progress_bar.child.comment = f'loss: {loss}'
progress_bar_validation = progress_bar(dataloader["validation"], parent=master_progress_bar)
for batch in progress_bar_validation:
input_ids = batch["input_ids"].to(device)
attention_mask = batch["attention_mask"].to(device)
labels = batch["label"].to(device)
with torch.no_grad():
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
predictions = torch.argmax(outputs['logits'], axis=-1)
accuracy = metric.add_batch(predictions=predictions, references=labels)
accuracy = metric.compute()
master_progress_bar.main_bar.comment = f"Validation accuracy: {accuracy['accuracy']}\n"
!mkdir accelerate_scripts
%%writefile accelerate_scripts/
import torch
from import DataLoader
from torch.optim import Adam
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import evaluate
from fastprogress.fastprogress import master_bar, progress_bar
dataset = load_dataset("tweet_eval", "emoji")
num_classes = len(dataset["train"].info.features["label"].names)
max_len = 130
checkpoints = "cardiffnlp/twitter-roberta-base-irony"
tokenizer = AutoTokenizer.from_pretrained(checkpoints)
def tokenize_function(dataset):
return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")
tokenized_dataset = {
"train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),
"validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),
"test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),
tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])
tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
BS = 64
dataloader = {
"train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),
"validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),
"test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),
model = AutoModelForSequenceClassification.from_pretrained(checkpoints)
model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)
loss_function = torch.nn.CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=5e-4)
metric = evaluate.load("accuracy")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
master_progress_bar = master_bar(range(EPOCHS))
for i in master_progress_bar:
progress_bar_train = progress_bar(dataloader["train"], parent=master_progress_bar)
for batch in progress_bar_train:
input_ids = batch["input_ids"].to(device)
attention_mask = batch["attention_mask"].to(device)
labels = batch["label"].to(device)
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
loss = loss_function(outputs['logits'], labels)
master_progress_bar.child.comment = f'loss: {loss}'
progress_bar_validation = progress_bar(dataloader["validation"], parent=master_progress_bar)
for batch in progress_bar_validation:
input_ids = batch["input_ids"].to(device)
attention_mask = batch["attention_mask"].to(device)
labels = batch["label"].to(device)
with torch.no_grad():
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
predictions = torch.argmax(outputs['logits'], axis=-1)
accuracy = metric.add_batch(predictions=predictions, references=labels)
accuracy = metric.compute()
master_progress_bar.main_bar.comment = f"Validation accuracy: {accuracy['accuracy']}\n"
print(f"Accuracy = {accuracy['accuracy']}")
Overwriting accelerate_scripts/

And now we run it

!python accelerate_scripts/
Accuracy = 0.2112
CPU times: user 2.12 s, sys: 391 ms, total: 2.51 s
Wall time: 3min 36s

We can see that on my computer it took about 3.5 minutes.

Code with acceleratelink image 35

Now we replace some things

  • First we import Accelerator and initialize it.
from accelerate import Accelerator
      accelerator = Accelerator()
  • We no longer do the typical
torch.device("cuda" if torch.cuda.is_available() else "cpu")
  • Instead, we let accelerate choose the device by means of
device = accelerator.device
  • We pass the relevant elements for training through the prepare method and no longer do
model, optimizer, dataloader["train"], dataloader["validation"] = preprare(model, optimizer, dataloader["train"], dataloader["validation"])
  • We no longer send the data and the model to the GPU with .to(device) since accelerate has taken care of it with the prepare method.

  • Instead of doing the backpropagation with loss.backward() we let accelerate do it with loss.backward().

  • When calculating the metric in the validation loop, we need to collect the values of all the points, in case we are doing a distributed training.
predictions = accelerator.gather_for_metrics(predictions)
%%writefile accelerate_scripts/
import torch
from import DataLoader
from torch.optim import Adam
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import evaluate
from fastprogress.fastprogress import master_bar, progress_bar
# Importamos e inicializamos Accelerator
from accelerate import Accelerator
accelerator = Accelerator()
dataset = load_dataset("tweet_eval", "emoji")
num_classes = len(dataset["train"].info.features["label"].names)
max_len = 130
checkpoints = "cardiffnlp/twitter-roberta-base-irony"
tokenizer = AutoTokenizer.from_pretrained(checkpoints)
def tokenize_function(dataset):
return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")
tokenized_dataset = {
"train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),
"validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),
"test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),
tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])
tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
BS = 64
dataloader = {
"train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),
"validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),
"test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),
model = AutoModelForSequenceClassification.from_pretrained(checkpoints)
model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)
loss_function = torch.nn.CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=5e-4)
metric = evaluate.load("accuracy")
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device = accelerator.device
model, optimizer, dataloader["train"], dataloader["validation"] = accelerator.prepare(model, optimizer, dataloader["train"], dataloader["validation"])
master_progress_bar = master_bar(range(EPOCHS))
for i in master_progress_bar:
progress_bar_train = progress_bar(dataloader["train"], parent=master_progress_bar)
for batch in progress_bar_train:
input_ids = batch["input_ids"]
attention_mask = batch["attention_mask"]
labels = batch["label"]
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
loss = loss_function(outputs['logits'], labels)
master_progress_bar.child.comment = f'loss: {loss}'
# loss.backward()
print(f"End of training epoch {i}, outputs['logits'].shape: {outputs['logits'].shape}, labels.shape: {labels.shape}")
progress_bar_validation = progress_bar(dataloader["validation"], parent=master_progress_bar)
for batch in progress_bar_validation:
input_ids = batch["input_ids"]
attention_mask = batch["attention_mask"]
labels = batch["label"]
with torch.no_grad():
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
predictions = torch.argmax(outputs['logits'], axis=-1)
# Recopilamos las predicciones de todos los dispositivos
predictions = accelerator.gather_for_metrics(predictions)
labels = accelerator.gather_for_metrics(labels)
accuracy = metric.add_batch(predictions=predictions, references=labels)
accuracy = metric.compute()
print(f"End of validation epoch {i}, outputs['logits'].shape: {outputs['logits'].shape}, labels.shape: {labels.shape}")
master_progress_bar.main_bar.comment = f"Validation accuracy: {accuracy['accuracy']}\n"
print(f"Accuracy = {accuracy['accuracy']}")
Overwriting accelerate_scripts/

If you notice I have added these two lines print(f "End of training epoch {i}, outputs['logits'].shape: {outputs['logits'].shape}, labels.shape: {labels.shape}") and the line print(f "End of validation epoch {i}, outputs['logits'].shape: {outputs['logits'].shape}, labels.shape: {labels.shape}"), I added them on purpose because they will reveal something very important

Now we execute it, to execute the accelerate scripts it is done with the command accelerate launch.

accelerate launch
!accelerate launch accelerate_scripts/
End of training epoch 0, outputs['logits'].shape: torch.Size([64, 20]), labels.shape: torch.Size([64])
End of training epoch 0, outputs['logits'].shape: torch.Size([64, 20]), labels.shape: torch.Size([64])
End of validation epoch 0, outputs['logits'].shape: torch.Size([64, 20]), labels.shape: torch.Size([8])
Accuracy = 0.206
End of validation epoch 0, outputs['logits'].shape: torch.Size([64, 20]), labels.shape: torch.Size([8])
Accuracy = 0.206
CPU times: user 1.6 s, sys: 272 ms, total: 1.88 s
Wall time: 2min 37s

We see that before it took about 3 and a half minutes and now it takes about 2 and a half minutes. Quite an improvement. Also if we look at the prints we can see that they have been printed twice.

And how can this be? Because accelerate has parallelized the training on the two GPUs I have, so it has been much faster.

Also, when I ran the first script, that is, when I did not use accelerate, the GPU was almost full, while when I ran the second one, that is, the one using accelerate, the two GPUs were very little used, so we can increase the batch size to try to fill both of them, let's go for it!

%%writefile accelerate_scripts/
import torch
from import DataLoader
from torch.optim import Adam
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import evaluate
from fastprogress.fastprogress import master_bar, progress_bar
# Importamos e inicializamos Accelerator
from accelerate import Accelerator
accelerator = Accelerator()
dataset = load_dataset("tweet_eval", "emoji")
num_classes = len(dataset["train"].info.features["label"].names)
max_len = 130
checkpoints = "cardiffnlp/twitter-roberta-base-irony"
tokenizer = AutoTokenizer.from_pretrained(checkpoints)
def tokenize_function(dataset):
return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")
tokenized_dataset = {
"train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),
"validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),
"test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),
tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])
tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
BS = 128
dataloader = {
"train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),
"validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),
"test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),
model = AutoModelForSequenceClassification.from_pretrained(checkpoints)
model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)
loss_function = torch.nn.CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=5e-4)
metric = evaluate.load("accuracy")
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device = accelerator.device
model, optimizer, dataloader["train"], dataloader["validation"] = accelerator.prepare(model, optimizer, dataloader["train"], dataloader["validation"])
master_progress_bar = master_bar(range(EPOCHS))
for i in master_progress_bar:
progress_bar_train = progress_bar(dataloader["train"], parent=master_progress_bar)
for batch in progress_bar_train:
input_ids = batch["input_ids"]
attention_mask = batch["attention_mask"]
labels = batch["label"]
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
loss = loss_function(outputs['logits'], labels)
master_progress_bar.child.comment = f'loss: {loss}'
# loss.backward()
progress_bar_validation = progress_bar(dataloader["validation"], parent=master_progress_bar)
for batch in progress_bar_validation:
input_ids = batch["input_ids"]
attention_mask = batch["attention_mask"]
labels = batch["label"]
with torch.no_grad():
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
predictions = torch.argmax(outputs['logits'], axis=-1)
# Recopilamos las predicciones de todos los dispositivos
predictions = accelerator.gather_for_metrics(predictions)
labels = accelerator.gather_for_metrics(labels)
accuracy = metric.add_batch(predictions=predictions, references=labels)
accuracy = metric.compute()
master_progress_bar.main_bar.comment = f"Validation accuracy: {accuracy['accuracy']}\n"
print(f"Accuracy = {accuracy['accuracy']}")
Overwriting accelerate_scripts/

I have removed the extra prints, because we have already seen that the code is running on both GPUs, and I have increased the batch size from 64 to 128.

!accelerate launch accelerate_scripts/
Accuracy = 0.1052
Accuracy = 0.1052
CPU times: user 1.41 s, sys: 180 ms, total: 1.59 s
Wall time: 2min 22s

Increasing the batch size has reduced the execution time by a few seconds.

Process executionlink image 36

Execution of code in a single processlink image 37

Before we have seen that the prints were printed twice, this is because accelerate creates as many processes as devices where the code is executed, in my case it creates two processes because I have two GPUs.

However, not all code should be executed in all processes, for example, the prints slow down the code too much to execute it several times, if checkpoints are saved, they would be saved twice, etc.

In order to execute part of a code in a single process you have to encapsulate it in a function and decorate it with accelerator.on_local_main_process, for example in the following code you will see that I created the following function

      def print_something(something):

Another option is to put the code inside an if accelerator.is_local_main_process as in the following code

if accelerator.is_local_main_process:
%%writefile accelerate_scripts/
import torch
from import DataLoader
from torch.optim import Adam
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import evaluate
from fastprogress.fastprogress import master_bar, progress_bar
# Importamos e inicializamos Accelerator
from accelerate import Accelerator
accelerator = Accelerator()
dataset = load_dataset("tweet_eval", "emoji")
num_classes = len(dataset["train"].info.features["label"].names)
max_len = 130
checkpoints = "cardiffnlp/twitter-roberta-base-irony"
tokenizer = AutoTokenizer.from_pretrained(checkpoints)
def tokenize_function(dataset):
return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")
tokenized_dataset = {
"train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),
"validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),
"test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),
tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])
tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
BS = 128
dataloader = {
"train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),
"validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),
"test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),
model = AutoModelForSequenceClassification.from_pretrained(checkpoints)
model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)
loss_function = torch.nn.CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=5e-4)
metric = evaluate.load("accuracy")
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device = accelerator.device
model, optimizer, dataloader["train"], dataloader["validation"] = accelerator.prepare(model, optimizer, dataloader["train"], dataloader["validation"])
def print_something(something):
master_progress_bar = master_bar(range(EPOCHS))
for i in master_progress_bar:
progress_bar_train = progress_bar(dataloader["train"], parent=master_progress_bar)
for batch in progress_bar_train:
input_ids = batch["input_ids"]
attention_mask = batch["attention_mask"]
labels = batch["label"]
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
loss = loss_function(outputs['logits'], labels)
master_progress_bar.child.comment = f'loss: {loss}'
# loss.backward()
progress_bar_validation = progress_bar(dataloader["validation"], parent=master_progress_bar)
for batch in progress_bar_validation:
input_ids = batch["input_ids"]
attention_mask = batch["attention_mask"]
labels = batch["label"]
with torch.no_grad():
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
predictions = torch.argmax(outputs['logits'], axis=-1)
# Recopilamos las predicciones de todos los dispositivos
predictions = accelerator.gather_for_metrics(predictions)
labels = accelerator.gather_for_metrics(labels)
accuracy = metric.add_batch(predictions=predictions, references=labels)
accuracy = metric.compute()
master_progress_bar.main_bar.comment = f"Validation accuracy: {accuracy['accuracy']}\n"
# print(f"Accuracy = {accuracy['accuracy']}")
print_something(f"Accuracy = {accuracy['accuracy']}")
if accelerator.is_local_main_process:
print(f"End of script with {accuracy['accuracy']} accuracy")
Overwriting accelerate_scripts/

Let's run it and see

!accelerate launch accelerate_scripts/
Accuracy = 0.2098
End of script with 0.2098 accuracy
CPU times: user 1.38 s, sys: 197 ms, total: 1.58 s
Wall time: 2min 22s

Now the print has only been printed once

However, although you don't see much, progress bars are executed in each process.

I have not found a way to avoid this with fastprogress progress bars, but with tqdm progress bars, so I will replace fastprogress progress bars with tqdm progress bars and to make them run in a single process add the argument disable=not accelerator.is_local_main_process.

%%writefile accelerate_scripts/
import torch
from import DataLoader
from torch.optim import Adam
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import evaluate
import tqdm
# Importamos e inicializamos Accelerator
from accelerate import Accelerator
accelerator = Accelerator()
dataset = load_dataset("tweet_eval", "emoji")
num_classes = len(dataset["train"].info.features["label"].names)
max_len = 130
checkpoints = "cardiffnlp/twitter-roberta-base-irony"
tokenizer = AutoTokenizer.from_pretrained(checkpoints)
def tokenize_function(dataset):
return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")
tokenized_dataset = {
"train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),
"validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),
"test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),
tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])
tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
BS = 128
dataloader = {
"train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),
"validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),
"test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),
model = AutoModelForSequenceClassification.from_pretrained(checkpoints)
model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)
loss_function = torch.nn.CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=5e-4)
metric = evaluate.load("accuracy")
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device = accelerator.device
model, optimizer, dataloader["train"], dataloader["validation"] = accelerator.prepare(model, optimizer, dataloader["train"], dataloader["validation"])
def print_something(something):
for i in range(EPOCHS):
# progress_bar_train = progress_bar(dataloader["train"], parent=master_progress_bar)
progress_bar_train = tqdm.tqdm(dataloader["train"], disable=not accelerator.is_local_main_process)
for batch in progress_bar_train:
input_ids = batch["input_ids"]
attention_mask = batch["attention_mask"]
labels = batch["label"]
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
loss = loss_function(outputs['logits'], labels)
# master_progress_bar.child.comment = f'loss: {loss}'
# loss.backward()
# progress_bar_validation = progress_bar(dataloader["validation"], parent=master_progress_bar)
progress_bar_validation = tqdm.tqdm(dataloader["validation"], disable=not accelerator.is_local_main_process)
for batch in progress_bar_validation:
input_ids = batch["input_ids"]
attention_mask = batch["attention_mask"]
labels = batch["label"]
with torch.no_grad():
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
predictions = torch.argmax(outputs['logits'], axis=-1)
# Recopilamos las predicciones de todos los dispositivos
predictions = accelerator.gather_for_metrics(predictions)
labels = accelerator.gather_for_metrics(labels)
accuracy = metric.add_batch(predictions=predictions, references=labels)
accuracy = metric.compute()
# print(f"Accuracy = {accuracy['accuracy']}")
print_something(f"Accuracy = {accuracy['accuracy']}")
if accelerator.is_local_main_process:
print(f"End of script with {accuracy['accuracy']} accuracy")
Overwriting accelerate_scripts/
!accelerate launch accelerate_scripts/
100%|█████████████████████████████████████████| 176/176 [02:01<00:00, 1.45it/s]
100%|███████████████████████████████████████████| 20/20 [00:06<00:00, 3.30it/s]
Accuracy = 0.2166
End of script with 0.2166 accuracy
CPU times: user 1.33 s, sys: 195 ms, total: 1.52 s
Wall time: 2min 22s

We have shown an example of how to print in a single process, and this has been a way to execute processes in a single process. But if you just want to print in a single process you can use the print method of accelerate. Let's see the same example of before with this method

%%writefile accelerate_scripts/
import torch
from import DataLoader
from torch.optim import Adam
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import evaluate
import tqdm
# Importamos e inicializamos Accelerator
from accelerate import Accelerator
accelerator = Accelerator()
dataset = load_dataset("tweet_eval", "emoji")
num_classes = len(dataset["train"].info.features["label"].names)
max_len = 130
checkpoints = "cardiffnlp/twitter-roberta-base-irony"
tokenizer = AutoTokenizer.from_pretrained(checkpoints)
def tokenize_function(dataset):
return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")
tokenized_dataset = {
"train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),
"validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),
"test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),
tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])
tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
BS = 128
dataloader = {
"train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),
"validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),
"test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),
model = AutoModelForSequenceClassification.from_pretrained(checkpoints)
model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)
loss_function = torch.nn.CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=5e-4)
metric = evaluate.load("accuracy")
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device = accelerator.device
model, optimizer, dataloader["train"], dataloader["validation"] = accelerator.prepare(model, optimizer, dataloader["train"], dataloader["validation"])
for i in range(EPOCHS):
# progress_bar_train = progress_bar(dataloader["train"], parent=master_progress_bar)
progress_bar_train = tqdm.tqdm(dataloader["train"], disable=not accelerator.is_local_main_process)
for batch in progress_bar_train:
input_ids = batch["input_ids"]
attention_mask = batch["attention_mask"]
labels = batch["label"]
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
loss = loss_function(outputs['logits'], labels)
# master_progress_bar.child.comment = f'loss: {loss}'
# loss.backward()
# progress_bar_validation = progress_bar(dataloader["validation"], parent=master_progress_bar)
progress_bar_validation = tqdm.tqdm(dataloader["validation"], disable=not accelerator.is_local_main_process)
for batch in progress_bar_validation:
input_ids = batch["input_ids"]
attention_mask = batch["attention_mask"]
labels = batch["label"]
with torch.no_grad():
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
predictions = torch.argmax(outputs['logits'], axis=-1)
# Recopilamos las predicciones de todos los dispositivos
predictions = accelerator.gather_for_metrics(predictions)
labels = accelerator.gather_for_metrics(labels)
accuracy = metric.add_batch(predictions=predictions, references=labels)
accuracy = metric.compute()
# print(f"Accuracy = {accuracy['accuracy']}")
accelerator.print(f"Accuracy = {accuracy['accuracy']}")
if accelerator.is_local_main_process:
print(f"End of script with {accuracy['accuracy']} accuracy")
Writing accelerate_scripts/

We run it

!accelerate launch accelerate_scripts/
Map: 100%|██████████████████████| 45000/45000 [00:02<00:00, 15433.52 examples/s]
Map: 100%|████████████████████████| 5000/5000 [00:00<00:00, 11406.61 examples/s]
Map: 100%|██████████████████████| 45000/45000 [00:02<00:00, 15036.87 examples/s]
Map: 100%|██████████████████████| 50000/50000 [00:03<00:00, 14932.76 examples/s]
Map: 100%|██████████████████████| 50000/50000 [00:03<00:00, 14956.60 examples/s]
100%|█████████████████████████████████████████| 176/176 [02:00<00:00, 1.46it/s]
100%|███████████████████████████████████████████| 20/20 [00:05<00:00, 3.33it/s]
Accuracy = 0.2134
End of script with 0.2134 accuracy
CPU times: user 1.4 s, sys: 189 ms, total: 1.59 s
Wall time: 2min 27s

Code execution in all processeslink image 38

However there is code that must be executed in all processes, for example if we upload the checkpoints to the hub, so here we have two options, encapsulate the code in a function and decorate it with accelerator.on_main_process.

      def do_my_thing():
          "Something done once per server"

or put the code inside an if accelerator.is_main_process.

if accelerator.is_main_process:

As we are training just to show the accelerate library and the model we are training is not good, there is no sense now to upload the checkpoints to the hub, so I am going to make an example with prints

%%writefile accelerate_scripts/
import torch
from import DataLoader
from torch.optim import Adam
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import evaluate
import tqdm
# Importamos e inicializamos Accelerator
from accelerate import Accelerator
accelerator = Accelerator()
dataset = load_dataset("tweet_eval", "emoji")
num_classes = len(dataset["train"].info.features["label"].names)
max_len = 130
checkpoints = "cardiffnlp/twitter-roberta-base-irony"
tokenizer = AutoTokenizer.from_pretrained(checkpoints)
def tokenize_function(dataset):
return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")
tokenized_dataset = {
"train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),
"validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),
"test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),
tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])
tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
BS = 128
dataloader = {
"train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),
"validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),
"test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),
model = AutoModelForSequenceClassification.from_pretrained(checkpoints)
model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)
loss_function = torch.nn.CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=5e-4)
metric = evaluate.load("accuracy")
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device = accelerator.device
model, optimizer, dataloader["train"], dataloader["validation"] = accelerator.prepare(model, optimizer, dataloader["train"], dataloader["validation"])
def print_in_one_process(something):
def print_in_all_processes(something):
for i in range(EPOCHS):
progress_bar_train = tqdm.tqdm(dataloader["train"], disable=not accelerator.is_local_main_process)
for batch in progress_bar_train:
input_ids = batch["input_ids"]
attention_mask = batch["attention_mask"]
labels = batch["label"]
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
loss = loss_function(outputs['logits'], labels)
# loss.backward()
progress_bar_validation = tqdm.tqdm(dataloader["validation"], disable=not accelerator.is_local_main_process)
for batch in progress_bar_validation:
input_ids = batch["input_ids"]
attention_mask = batch["attention_mask"]
labels = batch["label"]
with torch.no_grad():
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
predictions = torch.argmax(outputs['logits'], axis=-1)
# Recopilamos las predicciones de todos los dispositivos
predictions = accelerator.gather_for_metrics(predictions)
labels = accelerator.gather_for_metrics(labels)
accuracy = metric.add_batch(predictions=predictions, references=labels)
accuracy = metric.compute()
print_in_one_process(f"Accuracy = {accuracy['accuracy']}")
if accelerator.is_local_main_process:
print(f"End of script with {accuracy['accuracy']} accuracy")
print_in_all_processes(f"All process: Accuracy = {accuracy['accuracy']}")
if accelerator.is_main_process:
print(f"All process: End of script with {accuracy['accuracy']} accuracy")
Overwriting accelerate_scripts/

We run it to see

!accelerate launch accelerate_scripts/
Map: 100%|██████████████████████| 45000/45000 [00:03<00:00, 14518.44 examples/s]
Map: 100%|██████████████████████| 45000/45000 [00:03<00:00, 14368.77 examples/s]
Map: 100%|████████████████████████| 5000/5000 [00:00<00:00, 16466.33 examples/s]
Map: 100%|████████████████████████| 5000/5000 [00:00<00:00, 14806.14 examples/s]
Map: 100%|██████████████████████| 50000/50000 [00:03<00:00, 14253.33 examples/s]
Map: 100%|██████████████████████| 50000/50000 [00:03<00:00, 14337.07 examples/s]
100%|█████████████████████████████████████████| 176/176 [02:00<00:00, 1.46it/s]
100%|███████████████████████████████████████████| 20/20 [00:05<00:00, 3.34it/s]
Accuracy = 0.2092
End of script with 0.2092 accuracy
All process: Accuracy = 0.2092
All process: End of script with 0.2092 accuracy
CPU times: user 1.42 s, sys: 216 ms, total: 1.64 s
Wall time: 2min 27s

Execution of code in the process Xlink image 39

Finally we can specify in which process we want to execute code, for this we must create a function and decorate it with @accelerator.on_process(process_index=0).

      def do_my_thing():
          "Something done on process index 0"

or decorate it with @accelerator.on_local_process(local_process_idx=0).

      def do_my_thing():
          "Something done on process index 0 on each server".

Here I have put the process 0, but you can put any number

%%writefile accelerate_scripts/
import torch
from import DataLoader
from torch.optim import Adam
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import evaluate
import tqdm
# Importamos e inicializamos Accelerator
from accelerate import Accelerator
accelerator = Accelerator()
dataset = load_dataset("tweet_eval", "emoji")
num_classes = len(dataset["train"].info.features["label"].names)
max_len = 130
checkpoints = "cardiffnlp/twitter-roberta-base-irony"
tokenizer = AutoTokenizer.from_pretrained(checkpoints)
def tokenize_function(dataset):
return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")
tokenized_dataset = {
"train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),
"validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),
"test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),
tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])
tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
BS = 128
dataloader = {
"train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),
"validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),
"test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),
model = AutoModelForSequenceClassification.from_pretrained(checkpoints)
model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)
loss_function = torch.nn.CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=5e-4)
metric = evaluate.load("accuracy")
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device = accelerator.device
model, optimizer, dataloader["train"], dataloader["validation"] = accelerator.prepare(model, optimizer, dataloader["train"], dataloader["validation"])
def print_in_one_process(something):
def print_in_all_processes(something):
def print_in_process_0(something):
print("Process 0: " + something)
def print_in_process_1(something):
print("Process 1: " + something)
for i in range(EPOCHS):
progress_bar_train = tqdm.tqdm(dataloader["train"], disable=not accelerator.is_local_main_process)
for batch in progress_bar_train:
input_ids = batch["input_ids"]
attention_mask = batch["attention_mask"]
labels = batch["label"]
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
loss = loss_function(outputs['logits'], labels)
# loss.backward()
progress_bar_validation = tqdm.tqdm(dataloader["validation"], disable=not accelerator.is_local_main_process)
for batch in progress_bar_validation:
input_ids = batch["input_ids"]
attention_mask = batch["attention_mask"]
labels = batch["label"]
with torch.no_grad():
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
predictions = torch.argmax(outputs['logits'], axis=-1)
# Recopilamos las predicciones de todos los dispositivos
predictions = accelerator.gather_for_metrics(predictions)
labels = accelerator.gather_for_metrics(labels)
accuracy = metric.add_batch(predictions=predictions, references=labels)
accuracy = metric.compute()
print_in_one_process(f"Accuracy = {accuracy['accuracy']}")
if accelerator.is_local_main_process:
print(f"End of script with {accuracy['accuracy']} accuracy")
print_in_all_processes(f"All process: Accuracy = {accuracy['accuracy']}")
if accelerator.is_main_process:
print(f"All process: End of script with {accuracy['accuracy']} accuracy")
print_in_process_0("End of process 0")
print_in_process_1("End of process 1")
Overwriting accelerate_scripts/

We run it

!accelerate launch accelerate_scripts/
Map: 100%|████████████████████████| 5000/5000 [00:00<00:00, 15735.58 examples/s]
Map: 100%|██████████████████████| 50000/50000 [00:03<00:00, 14906.20 examples/s]
100%|█████████████████████████████████████████| 176/176 [02:02<00:00, 1.44it/s]
100%|███████████████████████████████████████████| 20/20 [00:06<00:00, 3.27it/s]
Process 1: End of process 1
Accuracy = 0.2128
End of script with 0.2128 accuracy
All process: Accuracy = 0.2128
All process: End of script with 0.2128 accuracy
Process 0: End of process 0
CPU times: user 1.42 s, sys: 295 ms, total: 1.71 s
Wall time: 2min 37s

Synchronize processeslink image 40

If we have code that must be executed in all processes, it is interesting to wait for it to finish in all processes before doing another task, so for this we use accelerator.wait_for_everyone().

To see this we are going to put a delay in one of the print functions in a process

I've also put a break in the training loop so that he doesn't spend too much time training, which is not what we're interested in right now.

%%writefile accelerate_scripts/
import torch
from import DataLoader
from torch.optim import Adam
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import evaluate
import tqdm
import time
# Importamos e inicializamos Accelerator
from accelerate import Accelerator
accelerator = Accelerator()
dataset = load_dataset("tweet_eval", "emoji")
num_classes = len(dataset["train"].info.features["label"].names)
max_len = 130
checkpoints = "cardiffnlp/twitter-roberta-base-irony"
tokenizer = AutoTokenizer.from_pretrained(checkpoints)
def tokenize_function(dataset):
return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")
tokenized_dataset = {
"train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),
"validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),
"test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),
tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])
tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
BS = 128
dataloader = {
"train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),
"validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),
"test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),
model = AutoModelForSequenceClassification.from_pretrained(checkpoints)
model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)
loss_function = torch.nn.CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=5e-4)
metric = evaluate.load("accuracy")
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device = accelerator.device
model, optimizer, dataloader["train"], dataloader["validation"] = accelerator.prepare(model, optimizer, dataloader["train"], dataloader["validation"])
def print_in_one_process(something):
def print_in_all_processes(something):
def print_in_process_0(something):
print("Process 0: " + something)
def print_in_process_1(something):
print("Process 1: " + something)
for i in range(EPOCHS):
progress_bar_train = tqdm.tqdm(dataloader["train"], disable=not accelerator.is_local_main_process)
for batch in progress_bar_train:
input_ids = batch["input_ids"]
attention_mask = batch["attention_mask"]
labels = batch["label"]
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
loss = loss_function(outputs['logits'], labels)
# loss.backward()
progress_bar_validation = tqdm.tqdm(dataloader["validation"], disable=not accelerator.is_local_main_process)
for batch in progress_bar_validation:
input_ids = batch["input_ids"]
attention_mask = batch["attention_mask"]
labels = batch["label"]
with torch.no_grad():
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
predictions = torch.argmax(outputs['logits'], axis=-1)
# Recopilamos las predicciones de todos los dispositivos
predictions = accelerator.gather_for_metrics(predictions)
labels = accelerator.gather_for_metrics(labels)
accuracy = metric.add_batch(predictions=predictions, references=labels)
accuracy = metric.compute()
print_in_one_process(f"Accuracy = {accuracy['accuracy']}")
if accelerator.is_local_main_process:
print(f"End of script with {accuracy['accuracy']} accuracy")
print_in_all_processes(f"All process: Accuracy = {accuracy['accuracy']}")
if accelerator.is_main_process:
print(f"All process: End of script with {accuracy['accuracy']} accuracy")
print_in_one_process("Printing with delay in process 0")
print_in_process_0("End of process 0")
print_in_process_1("End of process 1")
print_in_one_process("End of script")
Overwriting accelerate_scripts/

We run it

!accelerate launch accelerate_scripts/
Map: 100%|████████████████████████| 5000/5000 [00:00<00:00, 14218.23 examples/s]
Map: 100%|████████████████████████| 5000/5000 [00:00<00:00, 14666.25 examples/s]
0%| | 0/176 [00:00<?, ?it/s]
100%|███████████████████████████████████████████| 20/20 [00:05<00:00, 3.58it/s]
Process 1: End of process 1
Accuracy = 0.212
End of script with 0.212 accuracy
All process: Accuracy = 0.212
All process: End of script with 0.212 accuracy
Printing with delay in process 0
Process 0: End of process 0
End of script

As you can see first we have printed Process 1: End of process 1 and then the rest, this is because the rest of the prints are made either in process 0 or in all processes, so until the 2 seconds delay we have set is not finished the rest of the code is not executed.

Save and load the state dictlink image 41

When we train, we sometimes save the state so that we can continue at a later time.

To save the state we will have to use the save_state() and load_state() methods.

%%writefile accelerate_scripts/
import torch
from import DataLoader
from torch.optim import Adam
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import evaluate
import tqdm
# Importamos e inicializamos Accelerator
from accelerate import Accelerator
accelerator = Accelerator()
dataset = load_dataset("tweet_eval", "emoji")
num_classes = len(dataset["train"].info.features["label"].names)
max_len = 130
checkpoints = "cardiffnlp/twitter-roberta-base-irony"
tokenizer = AutoTokenizer.from_pretrained(checkpoints)
def tokenize_function(dataset):
return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")
tokenized_dataset = {
"train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),
"validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),
"test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),
tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])
tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
BS = 128
dataloader = {
"train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),
"validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),
"test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),
model = AutoModelForSequenceClassification.from_pretrained(checkpoints)
model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)
loss_function = torch.nn.CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=5e-4)
metric = evaluate.load("accuracy")
def print_something(something):
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device = accelerator.device
model, optimizer, dataloader["train"], dataloader["validation"] = accelerator.prepare(model, optimizer, dataloader["train"], dataloader["validation"])
for i in range(EPOCHS):
progress_bar_train = tqdm.tqdm(dataloader["train"], disable=not accelerator.is_local_main_process)
for batch in progress_bar_train:
input_ids = batch["input_ids"]
attention_mask = batch["attention_mask"]
labels = batch["label"]
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
loss = loss_function(outputs['logits'], labels)
# loss.backward()
progress_bar_validation = tqdm.tqdm(dataloader["validation"], disable=not accelerator.is_local_main_process)
for batch in progress_bar_validation:
input_ids = batch["input_ids"]
attention_mask = batch["attention_mask"]
labels = batch["label"]
with torch.no_grad():
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
predictions = torch.argmax(outputs['logits'], axis=-1)
# Recopilamos las predicciones de todos los dispositivos
predictions = accelerator.gather_for_metrics(predictions)
labels = accelerator.gather_for_metrics(labels)
accuracy = metric.add_batch(predictions=predictions, references=labels)
accuracy = metric.compute()
# Guardamos los pesos
print_something(f"Accuracy = {accuracy['accuracy']}")
# Cargamos los pesos
Overwriting accelerate_scripts/

We run it

!accelerate launch accelerate_scripts/
100%|█████████████████████████████████████████| 176/176 [01:58<00:00, 1.48it/s]
100%|███████████████████████████████████████████| 20/20 [00:05<00:00, 3.40it/s]
Accuracy = 0.2142

Save the modellink image 42

When the prepare method was used, the model was wrapped in order to save it to the necessary devices. So when saving it we have to use the save_model method which first unwraps it and then saves it. Also if we use the safe_serialization=True parameter the model will be saved as a safe tensor.

%%writefile accelerate_scripts/
import torch
from import DataLoader
from torch.optim import Adam
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import evaluate
import tqdm
# Importamos e inicializamos Accelerator
from accelerate import Accelerator
accelerator = Accelerator()
dataset = load_dataset("tweet_eval", "emoji")
num_classes = len(dataset["train"].info.features["label"].names)
max_len = 130
checkpoints = "cardiffnlp/twitter-roberta-base-irony"
tokenizer = AutoTokenizer.from_pretrained(checkpoints)
def tokenize_function(dataset):
return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")
tokenized_dataset = {
"train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),
"validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),
"test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),
tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])
tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
BS = 128
dataloader = {
"train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),
"validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),
"test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),
model = AutoModelForSequenceClassification.from_pretrained(checkpoints)
model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)
loss_function = torch.nn.CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=5e-4)
metric = evaluate.load("accuracy")
def print_something(something):
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device = accelerator.device
model, optimizer, dataloader["train"], dataloader["validation"] = accelerator.prepare(model, optimizer, dataloader["train"], dataloader["validation"])
for i in range(EPOCHS):
progress_bar_train = tqdm.tqdm(dataloader["train"], disable=not accelerator.is_local_main_process)
for batch in progress_bar_train:
input_ids = batch["input_ids"]
attention_mask = batch["attention_mask"]
labels = batch["label"]
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
loss = loss_function(outputs['logits'], labels)
# loss.backward()
progress_bar_validation = tqdm.tqdm(dataloader["validation"], disable=not accelerator.is_local_main_process)
for batch in progress_bar_validation:
input_ids = batch["input_ids"]
attention_mask = batch["attention_mask"]
labels = batch["label"]
with torch.no_grad():
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
predictions = torch.argmax(outputs['logits'], axis=-1)
# Recopilamos las predicciones de todos los dispositivos
predictions = accelerator.gather_for_metrics(predictions)
labels = accelerator.gather_for_metrics(labels)
accuracy = metric.add_batch(predictions=predictions, references=labels)
accuracy = metric.compute()
# Guardamos el modelo
accelerator.save_model(model, "accelerate_scripts/model", safe_serialization=True)
print_something(f"Accuracy = {accuracy['accuracy']}")
Writing accelerate_scripts/

We run it

!accelerate launch accelerate_scripts/
100%|█████████████████████████████████████████| 176/176 [01:58<00:00, 1.48it/s]
100%|███████████████████████████████████████████| 20/20 [00:05<00:00, 3.35it/s]
Accuracy = 0.214

Save the pretrained modellink image 43

In models that use the transformers library we must save the model with the save_pretrained method to be able to load it with the from_pretrained method. Before saving it we must unwrap it with the unwrap_model method.

%%writefile accelerate_scripts/
import torch
from import DataLoader
from torch.optim import Adam
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import evaluate
import tqdm
# Importamos e inicializamos Accelerator
from accelerate import Accelerator
accelerator = Accelerator()
dataset = load_dataset("tweet_eval", "emoji")
num_classes = len(dataset["train"].info.features["label"].names)
max_len = 130
checkpoints = "cardiffnlp/twitter-roberta-base-irony"
tokenizer = AutoTokenizer.from_pretrained(checkpoints)
def tokenize_function(dataset):
return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")
tokenized_dataset = {
"train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),
"validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),
"test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),
tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])
tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
BS = 128
dataloader = {
"train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),
"validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),
"test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),
model = AutoModelForSequenceClassification.from_pretrained(checkpoints)
model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)
loss_function = torch.nn.CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=5e-4)
metric = evaluate.load("accuracy")
def print_something(something):
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device = accelerator.device
model, optimizer, dataloader["train"], dataloader["validation"] = accelerator.prepare(model, optimizer, dataloader["train"], dataloader["validation"])
for i in range(EPOCHS):
progress_bar_train = tqdm.tqdm(dataloader["train"], disable=not accelerator.is_local_main_process)
for batch in progress_bar_train:
input_ids = batch["input_ids"]
attention_mask = batch["attention_mask"]
labels = batch["label"]
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
loss = loss_function(outputs['logits'], labels)
# loss.backward()
progress_bar_validation = tqdm.tqdm(dataloader["validation"], disable=not accelerator.is_local_main_process)
for batch in progress_bar_validation:
input_ids = batch["input_ids"]
attention_mask = batch["attention_mask"]
labels = batch["label"]
with torch.no_grad():
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
predictions = torch.argmax(outputs['logits'], axis=-1)
# Recopilamos las predicciones de todos los dispositivos
predictions = accelerator.gather_for_metrics(predictions)
labels = accelerator.gather_for_metrics(labels)
accuracy = metric.add_batch(predictions=predictions, references=labels)
accuracy = metric.compute()
# Guardamos el modelo pretrained
unwrapped_model = accelerator.unwrap_model(model)
print_something(f"Accuracy = {accuracy['accuracy']}")
Writing accelerate_scripts/

We run it

!accelerate launch accelerate_scripts/
Map: 100%|██████████████████████| 45000/45000 [00:02<00:00, 15152.47 examples/s]
Map: 100%|██████████████████████| 45000/45000 [00:02<00:00, 15119.13 examples/s]
Map: 100%|████████████████████████| 5000/5000 [00:00<00:00, 12724.70 examples/s]
Map: 100%|████████████████████████| 5000/5000 [00:00<00:00, 12397.49 examples/s]
Map: 100%|██████████████████████| 50000/50000 [00:03<00:00, 15247.21 examples/s]
Map: 100%|██████████████████████| 50000/50000 [00:03<00:00, 15138.03 examples/s]
100%|█████████████████████████████████████████| 176/176 [01:59<00:00, 1.48it/s]
100%|███████████████████████████████████████████| 20/20 [00:05<00:00, 3.37it/s]
Accuracy = 0.21

Now we could load it

from transformers import AutoModel
checkpoints = "accelerate_scripts/model_pretrained"
tokenizer = AutoModel.from_pretrained(checkpoints)
Some weights of RobertaModel were not initialized from the model checkpoint at accelerate_scripts/model_pretrained and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Training on notebookslink image 44

So far we have seen how to run scripts, but if you want to run the code on a notebook, we can write the same code as before, but encapsulated in a function

First we import the libraries

import torch
from import DataLoader
from torch.optim import Adam
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import evaluate
import tqdm
import time
# from accelerate import Accelerator

Now we create the function

import torch
from import DataLoader
from torch.optim import Adam
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import evaluate
import tqdm
import time
# from accelerate import Accelerator
def train_code(batch_size: int = 64):
from accelerate import Accelerator
accelerator = Accelerator()
dataset = load_dataset("tweet_eval", "emoji")
num_classes = len(dataset["train"].info.features["label"].names)
max_len = 130
checkpoints = "cardiffnlp/twitter-roberta-base-irony"
tokenizer = AutoTokenizer.from_pretrained(checkpoints)
def tokenize_function(dataset):
return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")
tokenized_dataset = {
"train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),
"validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),
"test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),
tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])
tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
BS = batch_size
dataloader = {
"train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),
"validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),
"test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),
model = AutoModelForSequenceClassification.from_pretrained(checkpoints)
model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)
loss_function = torch.nn.CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=5e-4)
metric = evaluate.load("accuracy")
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device = accelerator.device
model, optimizer, dataloader["train"], dataloader["validation"] = accelerator.prepare(model, optimizer, dataloader["train"], dataloader["validation"])
for i in range(EPOCHS):
progress_bar_train = tqdm.tqdm(dataloader["train"], disable=not accelerator.is_local_main_process)
for batch in progress_bar_train:
input_ids = batch["input_ids"]
attention_mask = batch["attention_mask"]
labels = batch["label"]
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
loss = loss_function(outputs['logits'], labels)
# loss.backward()
progress_bar_validation = tqdm.tqdm(dataloader["validation"], disable=not accelerator.is_local_main_process)
for batch in progress_bar_validation:
input_ids = batch["input_ids"]
attention_mask = batch["attention_mask"]
labels = batch["label"]
with torch.no_grad():
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
predictions = torch.argmax(outputs['logits'], axis=-1)
# Recopilamos las predicciones de todos los dispositivos
predictions = accelerator.gather_for_metrics(predictions)
labels = accelerator.gather_for_metrics(labels)
accuracy = metric.add_batch(predictions=predictions, references=labels)
accuracy = metric.compute()
accelerator.print(f"Accuracy = {accuracy['accuracy']}")

In order to run the training on the notebook we use the notebook_launcher function, to which we pass the function we want to run, the arguments of that function and the number of GPUs on which we are going to train with the variable num_processes.

from accelerate import notebook_launcher
      args = (128,)
      notebook_launcher(train_code, args, num_processes=2)
Launching training on 2 GPUs.
100%|██████████| 176/176 [02:01<00:00,  1.45it/s]
      100%|██████████| 20/20 [00:06<00:00,  3.31it/s]
Accuracy = 0.2112

Training in FP16link image 45

When we first set up accelerate it asked us Do you wish to use FP16 or BF16 (mixed precision)? and we said no, so now we are going to say yes, we want to use FP16.

So far we have trained in FP32, which means that each weight of the model is a 32-bit floating point number, and now we are going to use a 16-bit floating point number, that is, the model will occupy less space. So two things will happen, we will be able to use a larger batch size and it will also be faster.

First we re-launch accelerate config and we will tell it that we want FP16

import torch
from import DataLoader
from torch.optim import Adam
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import evaluate
import tqdm
import time
# from accelerate import Accelerator
def train_code(batch_size: int = 64):
from accelerate import Accelerator
accelerator = Accelerator()
dataset = load_dataset("tweet_eval", "emoji")
num_classes = len(dataset["train"].info.features["label"].names)
max_len = 130
checkpoints = "cardiffnlp/twitter-roberta-base-irony"
tokenizer = AutoTokenizer.from_pretrained(checkpoints)
def tokenize_function(dataset):
return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")
tokenized_dataset = {
"train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),
"validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),
"test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),
tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])
tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
BS = batch_size
dataloader = {
"train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),
"validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),
"test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),
model = AutoModelForSequenceClassification.from_pretrained(checkpoints)
model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)
loss_function = torch.nn.CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=5e-4)
metric = evaluate.load("accuracy")
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device = accelerator.device
model, optimizer, dataloader["train"], dataloader["validation"] = accelerator.prepare(model, optimizer, dataloader["train"], dataloader["validation"])
for i in range(EPOCHS):
progress_bar_train = tqdm.tqdm(dataloader["train"], disable=not accelerator.is_local_main_process)
for batch in progress_bar_train:
input_ids = batch["input_ids"]
attention_mask = batch["attention_mask"]
labels = batch["label"]
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
loss = loss_function(outputs['logits'], labels)
# loss.backward()
progress_bar_validation = tqdm.tqdm(dataloader["validation"], disable=not accelerator.is_local_main_process)
for batch in progress_bar_validation:
input_ids = batch["input_ids"]
attention_mask = batch["attention_mask"]
labels = batch["label"]
with torch.no_grad():
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
predictions = torch.argmax(outputs['logits'], axis=-1)
# Recopilamos las predicciones de todos los dispositivos
predictions = accelerator.gather_for_metrics(predictions)
labels = accelerator.gather_for_metrics(labels)
accuracy = metric.add_batch(predictions=predictions, references=labels)
accuracy = metric.compute()
accelerator.print(f"Accuracy = {accuracy['accuracy']}")
from accelerate import notebook_launcher
args = (128,)
notebook_launcher(train_code, args, num_processes=2)
!accelerate config
Launching training on 2 GPUs.
In which compute environment are you running?
This machine
How many different machines will you use (use more than 1 for multi-node training)? [1]: 1
Should distributed operations be checked while running for errors? This can avoid timeout issues but will be slower. [yes/NO]: no
Do you wish to optimize your script with torch dynamo?[yes/NO]:no
Do you want to use DeepSpeed? [yes/NO]: no
Do you want to use FullyShardedDataParallel? [yes/NO]: no
Do you want to use Megatron-LM ? [yes/NO]: no
How many GPU(s) should be used for distributed training? [1]:2
What GPU(s) (by id) should be used for training on this machine as a comma-seperated list? [all]:0,1
Do you wish to use FP16 or BF16 (mixed precision)?
accelerate configuration saved at ~/.cache/huggingface/accelerate/default_config.yaml

Now we create a script to train, with the same batch size as before, to see if it takes less time to train

%%writefile accelerate_scripts/
import torch
from import DataLoader
from torch.optim import Adam
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import evaluate
import tqdm
# Importamos e inicializamos Accelerator
from accelerate import Accelerator
accelerator = Accelerator()
dataset = load_dataset("tweet_eval", "emoji")
num_classes = len(dataset["train"].info.features["label"].names)
max_len = 130
checkpoints = "cardiffnlp/twitter-roberta-base-irony"
tokenizer = AutoTokenizer.from_pretrained(checkpoints)
def tokenize_function(dataset):
return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")
tokenized_dataset = {
"train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),
"validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),
"test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),
tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])
tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
BS = 128
dataloader = {
"train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),
"validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),
"test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),
model = AutoModelForSequenceClassification.from_pretrained(checkpoints)
model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)
loss_function = torch.nn.CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=5e-4)
metric = evaluate.load("accuracy")
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device = accelerator.device
model, optimizer, dataloader["train"], dataloader["validation"] = accelerator.prepare(model, optimizer, dataloader["train"], dataloader["validation"])
for i in range(EPOCHS):
progress_bar_train = tqdm.tqdm(dataloader["train"], disable=not accelerator.is_local_main_process)
for batch in progress_bar_train:
input_ids = batch["input_ids"]
attention_mask = batch["attention_mask"]
labels = batch["label"]
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
loss = loss_function(outputs['logits'], labels)
# loss.backward()
progress_bar_validation = tqdm.tqdm(dataloader["validation"], disable=not accelerator.is_local_main_process)
for batch in progress_bar_validation:
input_ids = batch["input_ids"]
attention_mask = batch["attention_mask"]
labels = batch["label"]
with torch.no_grad():
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
predictions = torch.argmax(outputs['logits'], axis=-1)
# Recopilamos las predicciones de todos los dispositivos
predictions = accelerator.gather_for_metrics(predictions)
labels = accelerator.gather_for_metrics(labels)
accuracy = metric.add_batch(predictions=predictions, references=labels)
accuracy = metric.compute()
accelerator.print(f"Accuracy = {accuracy['accuracy']}")
Overwriting accelerate_scripts/

We run it and see how long it takes

!accelerate launch accelerate_scripts/
Map: 100%|████████████████████████| 5000/5000 [00:00<00:00, 14983.76 examples/s]
Map: 100%|██████████████████████| 50000/50000 [00:03<00:00, 14315.47 examples/s]
100%|█████████████████████████████████████████| 176/176 [01:01<00:00, 2.88it/s]
100%|███████████████████████████████████████████| 20/20 [00:02<00:00, 6.84it/s]
Accuracy = 0.2094
CPU times: user 812 ms, sys: 163 ms, total: 976 ms
Wall time: 1min 27s

When we ran this training in FP32 it took about 2 minutes and a half, and now it takes about 1 minute and a half. Let's see if now instead of training with a batch size of 128, we do it with a batch size of 256.

%%writefile accelerate_scripts/
import torch
from import DataLoader
from torch.optim import Adam
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import evaluate
import tqdm
# Importamos e inicializamos Accelerator
from accelerate import Accelerator
accelerator = Accelerator()
dataset = load_dataset("tweet_eval", "emoji")
num_classes = len(dataset["train"].info.features["label"].names)
max_len = 130
checkpoints = "cardiffnlp/twitter-roberta-base-irony"
tokenizer = AutoTokenizer.from_pretrained(checkpoints)
def tokenize_function(dataset):
return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")
tokenized_dataset = {
"train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),
"validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),
"test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),
tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])
tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
BS = 256
dataloader = {
"train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),
"validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),
"test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),
model = AutoModelForSequenceClassification.from_pretrained(checkpoints)
model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)
loss_function = torch.nn.CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=5e-4)
metric = evaluate.load("accuracy")
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device = accelerator.device
model, optimizer, dataloader["train"], dataloader["validation"] = accelerator.prepare(model, optimizer, dataloader["train"], dataloader["validation"])
for i in range(EPOCHS):
progress_bar_train = tqdm.tqdm(dataloader["train"], disable=not accelerator.is_local_main_process)
for batch in progress_bar_train:
input_ids = batch["input_ids"]
attention_mask = batch["attention_mask"]
labels = batch["label"]
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
loss = loss_function(outputs['logits'], labels)
# loss.backward()
progress_bar_validation = tqdm.tqdm(dataloader["validation"], disable=not accelerator.is_local_main_process)
for batch in progress_bar_validation:
input_ids = batch["input_ids"]
attention_mask = batch["attention_mask"]
labels = batch["label"]
with torch.no_grad():
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
predictions = torch.argmax(outputs['logits'], axis=-1)
# Recopilamos las predicciones de todos los dispositivos
predictions = accelerator.gather_for_metrics(predictions)
labels = accelerator.gather_for_metrics(labels)
accuracy = metric.add_batch(predictions=predictions, references=labels)
accuracy = metric.compute()
accelerator.print(f"Accuracy = {accuracy['accuracy']}")
Overwriting accelerate_scripts/

We run it

!accelerate launch accelerate_scripts/
Map: 100%|████████████████████████| 5000/5000 [00:00<00:00, 15390.30 examples/s]
Map: 100%|████████████████████████| 5000/5000 [00:00<00:00, 14990.92 examples/s]
100%|███████████████████████████████████████████| 88/88 [00:54<00:00, 1.62it/s]
100%|███████████████████████████████████████████| 10/10 [00:02<00:00, 3.45it/s]
Accuracy = 0.2236
CPU times: user 670 ms, sys: 91.6 ms, total: 761 ms
Wall time: 1min 12s

It has dropped only about 15 seconds

BF16 Traininglink image 46

Before we have trained in FP16 and now we are going to train in BF16, what is the difference?


As we can see in the picture, while FP16 compared to FP32 has fewer bits in the mantissa and exponent, which makes its range much smaller, BF16 compared to FP32 has the same number of bits in the exponent but fewer in the mantissa, which makes BF16 have the same range of numbers as FP32, but it is less accurate.

This is beneficial because in FP16 some calculations could give very high numbers, which in FP16 format could not be represented. In addition there are certain HW devices that are optimized for this format.

As before, we execute accelerate config and indicate that we want BF16.

!accelerate config
In which compute environment are you running?
This machine
How many different machines will you use (use more than 1 for multi-node training)? [1]: 1
Should distributed operations be checked while running for errors? This can avoid timeout issues but will be slower. [yes/NO]: no
Do you wish to optimize your script with torch dynamo?[yes/NO]:no
Do you want to use DeepSpeed? [yes/NO]: no
Do you want to use FullyShardedDataParallel? [yes/NO]: no
Do you want to use Megatron-LM ? [yes/NO]: no
How many GPU(s) should be used for distributed training? [1]:2
What GPU(s) (by id) should be used for training on this machine as a comma-seperated list? [all]:0,1
Do you wish to use FP16 or BF16 (mixed precision)?
accelerate configuration saved at ~/.cache/huggingface/accelerate/default_config.yaml

Now we run the last script we created, i.e. with a batch size of 256

!accelerate launch accelerate_scripts/
Map: 100%|██████████████████████| 50000/50000 [00:03<00:00, 14814.95 examples/s]
Map: 100%|██████████████████████| 50000/50000 [00:03<00:00, 14506.83 examples/s]
100%|███████████████████████████████████████████| 88/88 [00:51<00:00, 1.70it/s]
100%|███████████████████████████████████████████| 10/10 [00:03<00:00, 3.21it/s]
Accuracy = 0.2112
CPU times: user 688 ms, sys: 144 ms, total: 832 ms
Wall time: 1min 17s

It took a similar time to what it took before, which is normal, since we have trained a model with 16-bit weights, just like before.

Training in FP8link image 47

Now we are going to train in FP8 format, which as its name suggests, is a floating point format, where each weight has 8 bits, so we run accelerate config to tell it that we want FP8

!accelerate config
In which compute environment are you running?
This machine
How many different machines will you use (use more than 1 for multi-node training)? [1]: 1
Should distributed operations be checked while running for errors? This can avoid timeout issues but will be slower. [yes/NO]: no
Do you wish to optimize your script with torch dynamo?[yes/NO]:no
Do you want to use DeepSpeed? [yes/NO]: no
Do you want to use FullyShardedDataParallel? [yes/NO]: no
Do you want to use Megatron-LM ? [yes/NO]: no
How many GPU(s) should be used for distributed training? [1]:2
What GPU(s) (by id) should be used for training on this machine as a comma-seperated list? [all]:0,1
Do you wish to use FP16 or BF16 (mixed precision)?
accelerate configuration saved at ~/.cache/huggingface/accelerate/default_config.yaml

Now we execute the last script, the batch size of 256

!accelerate launch accelerate_scripts/
Traceback (most recent call last):
File "/home/wallabot/Documentos/web/portafolio/posts/accelerate_scripts/", line 12, in <module>
accelerator = Accelerator()
File "/home/wallabot/miniconda3/envs/nlp/lib/python3.11/site-packages/accelerate/", line 371, in __init__
self.state = AcceleratorState(
File "/home/wallabot/miniconda3/envs/nlp/lib/python3.11/site-packages/accelerate/", line 790, in __init__
raise ValueError(
ValueError: Using `fp8` precision requires `transformer_engine` or `MS-AMP` to be installed.
Traceback (most recent call last):
File "/home/wallabot/Documentos/web/portafolio/posts/accelerate_scripts/", line 12, in <module>
accelerator = Accelerator()
File "/home/wallabot/miniconda3/envs/nlp/lib/python3.11/site-packages/accelerate/", line 371, in __init__
self.state = AcceleratorState(
File "/home/wallabot/miniconda3/envs/nlp/lib/python3.11/site-packages/accelerate/", line 790, in __init__
raise ValueError(
ValueError: Using `fp8` precision requires `transformer_engine` or `MS-AMP` to be installed.
[2024-05-13 21:40:56,455] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 501480) of binary: /home/wallabot/miniconda3/envs/nlp/bin/python
Traceback (most recent call last):
File "/home/wallabot/miniconda3/envs/nlp/bin/accelerate", line 8, in <module>
File "/home/wallabot/miniconda3/envs/nlp/lib/python3.11/site-packages/accelerate/commands/", line 46, in main
File "/home/wallabot/miniconda3/envs/nlp/lib/python3.11/site-packages/accelerate/commands/", line 1048, in launch_command
File "/home/wallabot/miniconda3/envs/nlp/lib/python3.11/site-packages/accelerate/commands/", line 702, in multi_gpu_launcher
File "/home/wallabot/miniconda3/envs/nlp/lib/python3.11/site-packages/torch/distributed/", line 803, in run
File "/home/wallabot/miniconda3/envs/nlp/lib/python3.11/site-packages/torch/distributed/launcher/", line 135, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/wallabot/miniconda3/envs/nlp/lib/python3.11/site-packages/torch/distributed/launcher/", line 268, in launch_agent
raise ChildFailedError(
accelerate_scripts/ FAILED
time : 2024-05-13_21:40:56
host : wallabot
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 501481)
error_file: <N/A>
traceback : To enable traceback see:
Root Cause (first observed failure):
time : 2024-05-13_21:40:56
host : wallabot
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 501480)
error_file: <N/A>
traceback : To enable traceback see:
CPU times: user 65.1 ms, sys: 14.5 ms, total: 79.6 ms
Wall time: 7.24 s

As the weights are now 8 bits and occupy half of the memory, we will increase the batch size to 512.

%%writefile accelerate_scripts/
import torch
from import DataLoader
from torch.optim import Adam
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import evaluate
import tqdm
# Importamos e inicializamos Accelerator
from accelerate import Accelerator
accelerator = Accelerator()
dataset = load_dataset("tweet_eval", "emoji")
num_classes = len(dataset["train"].info.features["label"].names)
max_len = 130
checkpoints = "cardiffnlp/twitter-roberta-base-irony"
tokenizer = AutoTokenizer.from_pretrained(checkpoints)
def tokenize_function(dataset):
return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")
tokenized_dataset = {
"train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),
"validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),
"test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),
tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])
tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
BS = 512
dataloader = {
"train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),
"validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),
"test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),
model = AutoModelForSequenceClassification.from_pretrained(checkpoints)
model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)
loss_function = torch.nn.CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=5e-4)
metric = evaluate.load("accuracy")
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device = accelerator.device
model, optimizer, dataloader["train"], dataloader["validation"] = accelerator.prepare(model, optimizer, dataloader["train"], dataloader["validation"])
for i in range(EPOCHS):
progress_bar_train = tqdm.tqdm(dataloader["train"], disable=not accelerator.is_local_main_process)
for batch in progress_bar_train:
input_ids = batch["input_ids"]
attention_mask = batch["attention_mask"]
labels = batch["label"]
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
loss = loss_function(outputs['logits'], labels)
# loss.backward()
progress_bar_validation = tqdm.tqdm(dataloader["validation"], disable=not accelerator.is_local_main_process)
for batch in progress_bar_validation:
input_ids = batch["input_ids"]
attention_mask = batch["attention_mask"]
labels = batch["label"]
with torch.no_grad():
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
predictions = torch.argmax(outputs['logits'], axis=-1)
# Recopilamos las predicciones de todos los dispositivos
predictions = accelerator.gather_for_metrics(predictions)
labels = accelerator.gather_for_metrics(labels)
accuracy = metric.add_batch(predictions=predictions, references=labels)
accuracy = metric.compute()
accelerator.print(f"Accuracy = {accuracy['accuracy']}")
Writing accelerate_scripts/

We run it

!accelerate launch accelerate_scripts/

Model inferencelink image 48

Using the Hugging Face Ecosystemlink image 49

Let's see how to do large model inference with the transformers library of hugging face.

Inference with image 50

If we use the Hugging Face ecosystem it is very simple, since everything is produced underneath without us having to do much. In the case of using pipeline, which is the easiest way to do inference with the transformers library, we simply have to tell it the model we want to use and very important, pass device_map="auto". This will cause accelerate to distribute the model among the different GPUs, CPU RAM or hard disk if necessary.

There are more possible values for device_map, which we will see later, but for now stay with "auto".

We are going to use the Llama3 8B model, which as its name indicates is a model of about 8 billion parameters, as each parameter by default is in FP32 format, which corresponds to 4 bytes (32 bits), that means that if we multiply 8 billion parameters by 4 bytes, we would need a GPU with about 32 GB of VRAM.

In my case I have 2 GPUs with 24 GB of VRAM, so it would not fit on a single GPU. But thanks to put device_map="auto", accelerate will distribute the model between the two GPUs and I will be able to make the inference.

!accelerate launch accelerate_scripts/
%%writefile accelerate_scripts/
from transformers import pipeline
checkpoints = "meta-llama/Meta-Llama-3-8B-Instruct"
generator = pipeline(model=checkpoints, device_map="auto")
prompt = "Conoces accelerate de hugging face?"
output = generator(prompt)
Overwriting accelerate_scripts/

Now we run it, but since pipeline uses accelerate below, we don't need to run it with accelerate launch but with python will do.

!python accelerate_scripts/
Loading checkpoint shards: 100%|██████████████████| 4/4 [00:09<00:00, 2.27s/it]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
[{'generated_text': 'Conoces accelerate de hugging face? ¿Qué es el modelo de lenguaje de transformers y cómo se utiliza en el marco de hugging face? ¿Cómo puedo utilizar modelos de lenguaje de transformers en mi aplicación? ¿Qué son los tokenizers y cómo se utilizan en el marco de hugging face? ¿Cómo puedo crear un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los datasets y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar datasets para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los evaluadores y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar evaluadores para evaluar el rendimiento de un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los pre-trainados y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar pre-trainados para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los finetuning y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar finetuning para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los checkpoints y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar checkpoints para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los evaluadores y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar evaluadores para evaluar el rendimiento de un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los pre-trainados y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar pre-trainados para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los finetuning y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar finetuning para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los checkpoints y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar checkpoints para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los evaluadores y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar evaluadores para evaluar el rendimiento de un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los pre-trainados y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar pre-trainados para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los finetuning y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar finetuning para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los checkpoints y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar checkpoints para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los evaluadores y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar evaluadores para evaluar el rendimiento de un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los pre-trainados y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar pre-trainados para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los finetuning y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar finetuning para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los checkpoints y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar checkpoints para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los evaluadores y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar evaluadores para evaluar el rendimiento de un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los pre-trainados y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar pre-trainados para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los finetuning y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar finetuning para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los checkpoints y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar checkpoints para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los evaluadores y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar evaluadores para evaluar el rendimiento de un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los pre-trainados y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar pre-trainados para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los finetuning y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar finetuning para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los checkpoints y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar checkpoints para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los evaluadores y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar evaluadores para evaluar el rendimiento de un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los pre-trainados y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar pre-trainados para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los finetuning y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar finetuning para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los checkpoints y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar checkpoints para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los evaluadores y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar evaluadores para evaluar el rendimiento de un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los pre-trainados y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar pre-trainados para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los finetuning y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar finetuning para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los checkpoints y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar checkpoints para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los evaluadores y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar evaluadores para evaluar el rendimiento de un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los pre-trainados y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar pre-trainados para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los finetuning y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar finetuning para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los checkpoints y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar checkpoints para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los evaluadores y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar evaluadores para evaluar el rendimiento de un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los pre-trainados y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar pre-trainados para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los finetuning y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar finetuning para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los checkpoints y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar checkpoints para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los evaluadores y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar evaluadores para evaluar el rendimiento de un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los pre-trainados y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar pre-trainados para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los finetuning y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar finetuning para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los checkpoints y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar checkpoints para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los evaluadores y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar evaluadores para evaluar el rendimiento de un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los pre-trainados y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar pre-trainados para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los finetuning y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar finetuning para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los checkpoints y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar checkpoints para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los evaluadores y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar evaluadores para evaluar el rendimiento de un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los pre-trainados y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar pre-trainados para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los finetuning y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar finetuning para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los checkpoints y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar checkpoints para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los evaluadores y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar evaluadores para evaluar el rendimiento de un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los pre-trainados y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar pre-trainados para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los finetuning y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar finetuning para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los checkpoints y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar checkpoints para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los evaluadores y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar evaluadores para evaluar el rendimiento de un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los pre-trainados y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar pre-trainados para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los finetuning y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar finetuning para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los checkpoints y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar checkpoints para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los evaluadores y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar evaluadores para evaluar el rendimiento de un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los pre-trainados y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar pre-trainados para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los finetuning y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar finetuning para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los checkpoints y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar checkpoints para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los evaluadores y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar evaluadores para evaluar el rendimiento de un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los pre-trainados y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar pre-trainados para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los finetuning y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar finetuning para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los checkpoints y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar checkpoints para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los evaluadores y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar evaluadores para evaluar el rendimiento de un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los pre-trainados y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar pre-trainados para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los finetuning y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar finetuning para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los checkpoints y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar checkpoints para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los evaluadores y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar evaluadores para evaluar el rendimiento de un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los pre-trainados y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar pre-trainados para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los finetuning y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar finetuning para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los checkpoints y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar checkpoints para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los evaluadores y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar evaluadores para evaluar el rendimiento de un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los pre-trainados y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar pre-trainados para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los finetuning y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar finetuning para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los checkpoints y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar checkpoints para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los evaluadores y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar evaluadores para evaluar el rendimiento de un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los pre-trainados y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar pre-trainados para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los finetuning y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar finetuning para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los checkpoints y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar checkpoints para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los evaluadores y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar evaluadores para evaluar el rendimiento de un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los pre-trainados y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar pre-trainados para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los finetuning'}]

As you can see, it did not answer, but kept asking questions. This is because Llama3 is a language model that predicts the next token, so with the prompt that I have passed it, it has considered that the next best tokens are those that correspond to more questions. Which makes sense, because there are times when people have doubts about a topic and generates many questions, so to answer the question we have to condition it a little bit

%%writefile accelerate_scripts/
from transformers import pipeline
checkpoints = "meta-llama/Meta-Llama-3-8B-Instruct"
generator = pipeline(model=checkpoints, device_map="auto")
prompt = "Conoces accelerate de hugging face?"
messages = [
"role": "system",
"content": "Eres un chatbot amigable que siempre intenta solucionar las dudas",
{"role": "user", "content": f"{prompt}"},
output = generator(messages)
Overwriting accelerate_scripts/

As you can see, a message has been generated with roles, conditioning the model and with the following prompt

!python accelerate_scripts/
Loading checkpoint shards: 100%|██████████████████| 4/4 [00:09<00:00, 2.41s/it]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
{'role': 'assistant', 'content': '¡Hola! Sí, conozco Accelerate de Hugging Face. Accelerate es una biblioteca de Python desarrollada por Hugging Face que se enfoca en simplificar y acelerar el entrenamiento y la evaluación de modelos de lenguaje en diferentes dispositivos y entornos. Con Accelerate, puedes entrenar modelos de lenguaje en diferentes plataformas y dispositivos, como GPUs, TPUs, CPUs y servidores, sin necesidad de cambiar el código de tu modelo. Esto te permite aprovechar al máximo la potencia de cálculo de tus dispositivos y reducir el tiempo de entrenamiento. Accelerate también ofrece varias características adicionales, como: * Soporte para diferentes frameworks de machine learning, como TensorFlow, PyTorch y JAX. * Integración con diferentes sistemas de almacenamiento y procesamiento de datos, como Amazon S3 y Google Cloud Storage. * Soporte para diferentes protocolos de comunicación, como HTTP y gRPC. * Herramientas para monitorear y depurar tus modelos en tiempo real. En resumen, Accelerate es una herramienta muy útil para desarrolladores de modelos de lenguaje que buscan simplificar y acelerar el proceso de entrenamiento y evaluación de sus modelos. ¿Tienes alguna pregunta específica sobre Accelerate o necesitas ayuda para implementarlo en tu proyecto?'}

Now the answer if it responds to our prompt

Inference with AutoClasslink image 51

Finally we will see how to do the inference only with AutoClass.

%%writefile accelerate_scripts/
from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer
checkpoints = "meta-llama/Meta-Llama-3-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(checkpoints, device_map="auto")
model = AutoModelForCausalLM.from_pretrained(checkpoints, device_map="auto")
streamer = TextStreamer(tokenizer)
prompt = "Conoces accelerate de hugging face?"
tokens_input = tokenizer([prompt], return_tensors="pt").to(model.device)
_ = model.generate(**tokens_input, streamer=streamer, max_new_tokens=500, do_sample=True, top_k=50, top_p=0.95, temperature=0.7)
Overwriting accelerate_scripts/

As you can see, the streamer object has been created and then passed to the generate method of the model. This is useful so that each word is printed as it is generated and you don't have to wait for all the output to be generated before printing it.

!python accelerate_scripts/
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Loading checkpoint shards: 100%|██████████████████| 4/4 [00:09<00:00, 2.28s/it]
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
<|begin_of_text|>Conoces accelerate de hugging face? Si es así, puedes utilizar la biblioteca `transformers` de Hugging Face para crear un modelo de lenguaje que pueda predecir la siguiente palabra en una secuencia de texto.
Aquí te muestro un ejemplo de cómo hacerlo:
import pandas as pd
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
# Cargar el modelo y el tokenizador
model_name = "bert-base-uncased"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Cargar el conjunto de datos
train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")
# Preprocesar los datos
train_texts = train_df["text"]
train_labels = train_df["label"]
test_texts = test_df["text"]
# Convertir los textos en entradas para el modelo
train_encodings = tokenizer.batch_encode_plus(train_texts,
test_encodings = tokenizer.batch_encode_plus(test_texts,
# Crear un dataloader para entrenar el modelo
train_dataset =["input_ids"],
train_loader =, batch_size=32, shuffle=True)
# Entrenar el modelo
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-5)
for epoch in range(5):
total_loss = 0
for batch in train_loader:
input_ids = batch[0].to(device)
attention_mask = batch[1].to(device)
labels = batch[2].to(device)
outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
loss = criterion(outputs, labels)
total_loss += loss.item()
print(f"Epoch {epoch+1}, Loss: {total

Use pytorchlink image 52

Normally the way to make inferences with pytorch is to create a model with the weights initialized randomly and then load a state dict with the weights of the pre-trained model, so to get that state dict we are going to make a little trick first and download them

import torch
import torchvision.models as models
model = models.resnet152(weights=models.ResNet152_Weights.IMAGENET1K_V1), 'accelerate_scripts/resnet152_pretrained.pth')
Downloading: "" to /home/maximo.fernandez/.cache/torch/hub/checkpoints/resnet152-394f9c45.pth
100%|██████████| 230M/230M [02:48<00:00, 1.43MB/s]

Now that we have the state dict let's do inference as it is normally done in pytorch

import torch
import torchvision.models as models
device = "cuda" if torch.cuda.is_available() else "cpu" # Set device
resnet152 = models.resnet152().to(device) # Create model with random weights and move to device
state_dict = torch.load('accelerate_scripts/resnet152_pretrained.pth', map_location=device) # Load pretrained weights into device memory
resnet152.load_state_dict(state_dict) # Load this weights into the model
input = torch.rand(1, 3, 224, 224).to(device) # Random image with batch size 1
output = resnet152(input)
torch.Size([1, 1000])

Let us explain what happened

  • When we did resnet152 = models.resnet152().to(device) a resnet152 with random weights was loaded into GPU memory.
  • When we did state_dict = torch.load('accelerate_scripts/resnet152_pretrained.pth', map_location=device) a dictionary with the trained weights was loaded into GPU memory.
  • When we have done resnet152.load_state_dict(state_dict) these pre-trained weights have been assigned to the model.

In other words, the model has been loaded twice in the GPU memory.

You may be wondering why we have done this first.

model = models.resnet152(weights=models.ResNet152_Weights.IMAGENET1K_V1), 'accelerate_scripts/resnet152_pretrained.pth')

To then make

resnet152 = models.resnet152().to(device)
      state_dict = torch.load('accelerate_scripts/resnet152_pretrained.pth', map_location=device)

And why don't we use directly

model = models.resnet152(weights=models.ResNet152_Weights.IMAGENET1K_V1)

And we stop saving the state dict to load it later. Well, because Pytorch, by edbajo does the same thing that we have done. So to be able to see the whole process we have done in several lines what Pytorch does in one line

This way of working has worked well until now, as long as the models had a manageable size for user GPUs. But since the advent of LLMs this approach does not make sense.

For example, a 6B model of parameters would occupy 24 GB of memory, and since it is loaded twice with this way of working, a 48 GB GPU would be required.

So to fix this, the way to load a pre-trained Pytorch model is:

  • Create an empty model with init_empty_weights that will not occupy RAM.
  • Then load the weights with load_checkpoint_and_dispatch which will load a checkpoint inside the empty model and distribute the weights for each layer on all available devices (GPU, CPU RAM and hard disk), thanks to setting device_map="auto".
import torch
import torchvision.models as models
from accelerate import init_empty_weights, load_checkpoint_and_dispatch
with init_empty_weights():
resnet152 = models.resnet152()
resnet152 = load_checkpoint_and_dispatch(resnet152, checkpoint='accelerate_scripts/resnet152_pretrained.pth', device_map="auto")
device = "cuda" if torch.cuda.is_available() else "cpu"
input = torch.rand(1, 3, 224, 224).to(device) # Random image with batch size 1
output = resnet152(input)
torch.Size([1, 1000])

How accelerate works belowlink image 53

In this video you can see graphically how accelerate works below

Initialization of an empty modellink image 54

Accelerate creates the skeleton of an empty model using init_empty_weights to occupy as little memory as possible.

For example, let's see how much RAM I now have available on my computer

import psutil
def get_ram_info():
ram = dict(psutil.virtual_memory()._asdict())
print(f"Total RAM: {(ram['total']/1024/1024/1024):.2f} GB, Available RAM: {(ram['available']/1024/1024/1024):.2f} GB, Used RAM: {(ram['used']/1024/1024/1024):.2f} GB")
Total RAM: 31.24 GB, Available RAM: 22.62 GB, Used RAM: 7.82 GB

I have about 22 GB of RAM available

Now let's try to create a model 5000x1000x1000 parameters, i.e. 5B parameters, if each parameter is in FP32, it means 20 GB of RAM.

import torch
from torch import nn
model = nn.Sequential(*[nn.Linear(5000, 1000) for _ in range(1000)])

If we look at RAM again

import torch
from torch import nn
model = nn.Sequential(*[nn.Linear(5000, 1000) for _ in range(1000)])
Total RAM: 31.24 GB, Available RAM: 3.77 GB, Used RAM: 26.70 GB

As we can see now we only have 3 GB of RAM available.

Now let's delete the model to free RAM

del model
Total RAM: 31.24 GB, Available RAM: 22.44 GB, Used RAM: 8.03 GB

We again have about 22 GB of RAM available.

Let's now use init_empty_weights from accelerate and then we see the RAM

from accelerate import init_empty_weights
with init_empty_weights():
model = nn.Sequential(*[nn.Linear(5000, 1000) for _ in range(1000)])
Total RAM: 31.24 GB, Available RAM: 22.32 GB, Used RAM: 8.16 GB

Before we had exactly 22.44 GB free and after creating the model with init_empty_weights we have 22.32 GB. The saving in RAM is enormous! Almost no RAM has been used to create the model.

This is based on the metadevice introduced in PyTorch 1.9, so it is important that to use accelerate we have a later version of Pytorch.

Loading weightslink image 55

Once we have initialized the model we have to load the weights using load_checkpoint_and_dispatch which, as its name indicates, loads the weights and sends them to the device or devices required.

Continue reading

Last posts -->

Have you seen these projects?


Subtify Subtify

Subtitle generator for videos in the language you want. Also, it puts a different color subtitle to each person

View all projects -->

Do you want to apply AI in your project? Contact me!

Do you want to improve with these tips?

Last tips -->

Use this locally

Hugging Face spaces allow us to run models with very simple demos, but what if the demo breaks? Or if the user deletes it? That's why I've created docker containers with some interesting spaces, to be able to use them locally, whatever happens. In fact, if you click on any project view button, it may take you to a space that doesn't work.

Flow edit

Flow edit Flow edit


FLUX.1-RealismLora FLUX.1-RealismLora
View all containers -->

Do you want to apply AI in your project? Contact me!

Do you want to train your model with these datasets?


Dataset with jokes in English


Dataset with translations from English to Spanish


Dataset with Netflix movies and series

View more datasets -->