Hugging Face Accelerate
Accelerate is a Hugging Face library that allows you to run the same PyTorch code in any distributed configuration by adding only four lines of code.
This notebook has been automatically translated to make it accessible to more people, please let me know if you see any typos.
Installation
To install accelerate
with pip
simply run:
pip install accelerate
And with conda
:
conda install -c conda-forge accelerate
Configuration
In every environment in which accelerate
is installed, the first thing to do is to configure it, for that we execute in a terminal:
accelerate config
!accelerate config
--------------------------------------------------------------------------------In which compute environment are you running?This machine--------------------------------------------------------------------------------multi-GPUHow many different machines will you use (use more than 1 for multi-node training)? [1]: 1Should distributed operations be checked while running for errors? This can avoid timeout issues but will be slower. [yes/NO]: noDo you wish to optimize your script with torch dynamo?[yes/NO]:noDo you want to use DeepSpeed? [yes/NO]: noDo you want to use FullyShardedDataParallel? [yes/NO]: noDo you want to use Megatron-LM ? [yes/NO]: noHow many GPU(s) should be used for distributed training? [1]:2What GPU(s) (by id) should be used for training on this machine as a comma-seperated list? [all]:0,1--------------------------------------------------------------------------------Do you wish to use FP16 or BF16 (mixed precision)?noaccelerate configuration saved at ~/.cache/huggingface/accelerate/default_config.yaml
In my case the answers have been
- In which compute environment are you running?
- "This machine"
- [_] "AWS (Amazon SageMaker)"
I want to configure it on my computer
- Which type of machine are you using?
- [_] multi-CPU
- [_] multi-XPU
- x] multi-GPU
- [_] multi-NPU
- [_] TPU
As I have 2 GPUs and I want to run distributed codes on them I choose
multi-GPU
.
- How many different machines will you use (use more than 1 for multi-node training)? [1]:
- 1
I choose
1
because I am only going to run on my computer.
- Should distributed operations be checked while running for errors? This can avoid timeout issues but will be slower. [yes/NO]:
- no
With this option, you can choose to have
accelerate
check for errors on execution, but it would slow it down, so I chooseno
, and in case there are errors I change it toyes
.
Do you wish to optimize your script with torch dynamo? [yes/NO]:
- no
Do you want to use FullyShardedDataParallel? [yes/NO]:
- no
Do you want to use Megatron-LM ? [yes/NO]:
- no
How many GPU(s) should be used for distributed training? [1]:
I choose
2
because I have 2 GPUs
- What GPU(s) (by id) should be used for training on this machine as a comma-seperated list? [all]:
- 0,1
I choose
0,1
because I want to use both GPUs.
- Do you wish to use FP16 or BF16 (mixed precision)?
- x] no
- [_] fp16
- [_] bf16
- [_] fp8
For the moment I choose
no
, because to simplify the code when not usingaccelerate
we are going to train on fp32, but ideally we should use fp16
The configuration will be stored in ~/.cache/huggingface/accelerate/default_config.yaml
and can be modified at any time. Let's see what's inside
!cat ~/.cache/huggingface/accelerate/default_config.yaml
compute_environment: LOCAL_MACHINEdebug: falsedistributed_type: MULTI_GPUdowncast_bf16: 'no'gpu_ids: 0,1machine_rank: 0main_training_function: mainmixed_precision: fp16num_machines: 1num_processes: 2rdzv_backend: staticsame_network: truetpu_env: []tpu_use_cluster: falsetpu_use_sudo: falseuse_cpu: false
Another way to see the configuration we have is to run it in a terminal:
accelerate env
!accelerate env
Copy-and-paste the text below in your GitHub issue- `Accelerate` version: 0.28.0- Platform: Linux-5.15.0-105-generic-x86_64-with-glibc2.31- Python version: 3.11.8- Numpy version: 1.26.4- PyTorch version (GPU?): 2.2.1+cu121 (True)- PyTorch XPU available: False- PyTorch NPU available: False- System RAM: 31.24 GB- GPU type: NVIDIA GeForce RTX 3090- `Accelerate` default config:- compute_environment: LOCAL_MACHINE- distributed_type: MULTI_GPU- mixed_precision: fp16- use_cpu: False- debug: False- num_processes: 2- machine_rank: 0- num_machines: 1- gpu_ids: 0,1- rdzv_backend: static- same_network: True- main_training_function: main- downcast_bf16: no- tpu_use_cluster: False- tpu_use_sudo: False- tpu_env: []
Once we have configured accelerate
we can test if we have done it right by running it in a terminal:
accelerate test
!accelerate test
Running: accelerate-launch ~/miniconda3/envs/nlp/lib/python3.11/site-packages/accelerate/test_utils/scripts/test_script.pystdout: **Initialization**stdout: Testing, testing. 1, 2, 3.stdout: Distributed environment: DistributedType.MULTI_GPU Backend: ncclstdout: Num processes: 2stdout: Process index: 0stdout: Local process index: 0stdout: Device: cuda:0stdout:stdout: Mixed precision type: fp16stdout:stdout: Distributed environment: DistributedType.MULTI_GPU Backend: ncclstdout: Num processes: 2stdout: Process index: 1stdout: Local process index: 1stdout: Device: cuda:1stdout:stdout: Mixed precision type: fp16stdout:stdout:stdout: **Test process execution**stdout:stdout: **Test split between processes as a list**stdout:stdout: **Test split between processes as a dict**stdout:stdout: **Test split between processes as a tensor**stdout:stdout: **Test random number generator synchronization**stdout: All rng are properly synched.stdout:stdout: **DataLoader integration test**stdout: 0 1 tensor([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,stdout: 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35,stdout: 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53,stdout: 54, 55, 56, 57, 58, 59, 60, 61, 62, 63], device='cuda:1') <class 'accelerate.data_loader.DataLoaderShard'>stdout: tensor([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,stdout: 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35,stdout: 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53,stdout: 54, 55, 56, 57, 58, 59, 60, 61, 62, 63], device='cuda:0') <class 'accelerate.data_loader.DataLoaderShard'>stdout: Non-shuffled dataloader passing.stdout: Shuffled dataloader passing.stdout: Non-shuffled central dataloader passing.stdout: Shuffled central dataloader passing.stdout:stdout: **Training integration test**stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32stdout: Training yielded the same results on one CPU or distributed setup with no batch split.stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32stdout: Training yielded the same results on one CPU or distributes setup with batch split.stdout: FP16 training check.stdout: FP16 training check.stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32stdout: Keep fp32 wrapper check.stdout: Keep fp32 wrapper check.stdout: BF16 training check.stdout: BF16 training check.stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32Model dtype: torch.float32, torch.float32. Input dtype: torch.float32stdout:stdout: Training yielded the same results on one CPU or distributed setup with no batch split.stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32stdout: FP16 training check.stdout: Training yielded the same results on one CPU or distributes setup with batch split.stdout: FP16 training check.stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32stdout: Keep fp32 wrapper check.stdout: Keep fp32 wrapper check.stdout: BF16 training check.stdout: BF16 training check.stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32stdout:stdout: **Breakpoint trigger test**Test is a success! You are ready for your distributed training!
We see that it ends saying Test is a success! You are ready for your distributed training!
so everything is correct.
Training
Training optimization
Base code
We will first make a base training code and then optimize it to see how it is done and how it improves.
First let's look for a dataset, in my case I will use the dataset tweet_eval, which is a tweet classification dataset, specifically I will download the subset emoji
which classifies tweets with emoticons.
from datasets import load_datasetdataset = load_dataset("tweet_eval", "emoji")dataset
DatasetDict({train: Dataset({features: ['text', 'label'],num_rows: 45000})test: Dataset({features: ['text', 'label'],num_rows: 50000})validation: Dataset({features: ['text', 'label'],num_rows: 5000})})
dataset["train"].info
DatasetInfo(description='', citation='', homepage='', license='', features={'text': Value(dtype='string', id=None), 'label': ClassLabel(names=['❤', '😍', '😂', '💕', '🔥', '😊', '😎', '✨', '💙', '😘', '📷', '🇺🇸', '☀', '💜', '😉', '💯', '😁', '🎄', '📸', '😜'], id=None)}, post_processed=None, supervised_keys=None, task_templates=None, builder_name='parquet', dataset_name='tweet_eval', config_name='emoji', version=0.0.0, splits={'train': SplitInfo(name='train', num_bytes=3808792, num_examples=45000, shard_lengths=None, dataset_name='tweet_eval'), 'test': SplitInfo(name='test', num_bytes=4262151, num_examples=50000, shard_lengths=None, dataset_name='tweet_eval'), 'validation': SplitInfo(name='validation', num_bytes=396704, num_examples=5000, shard_lengths=None, dataset_name='tweet_eval')}, download_checksums={'hf://datasets/tweet_eval@b3a375baf0f409c77e6bc7aa35102b7b3534f8be/emoji/train-00000-of-00001.parquet': {'num_bytes': 2609973, 'checksum': None}, 'hf://datasets/tweet_eval@b3a375baf0f409c77e6bc7aa35102b7b3534f8be/emoji/test-00000-of-00001.parquet': {'num_bytes': 3047341, 'checksum': None}, 'hf://datasets/tweet_eval@b3a375baf0f409c77e6bc7aa35102b7b3534f8be/emoji/validation-00000-of-00001.parquet': {'num_bytes': 281994, 'checksum': None}}, download_size=5939308, post_processing_size=None, dataset_size=8467647, size_in_bytes=14406955)
Let's take a look at the classes
print(dataset["train"].info.features["label"].names)
['❤', '😍', '😂', '💕', '🔥', '😊', '😎', '✨', '💙', '😘', '📷', '🇺🇸', '☀', '💜', '😉', '💯', '😁', '🎄', '📸', '😜']
And the number of classes
num_classes = len(dataset["train"].info.features["label"].names)num_classes
20
We see that the dataset has 20 classes
Let's see the maximum sequence of each split
max_len_train = 0max_len_val = 0max_len_test = 0split = "train"for i in range(len(dataset[split])):len_i = len(dataset[split][i]["text"])if len_i > max_len_train:max_len_train = len_isplit = "validation"for i in range(len(dataset[split])):len_i = len(dataset[split][i]["text"])if len_i > max_len_val:max_len_val = len_isplit = "test"for i in range(len(dataset[split])):len_i = len(dataset[split][i]["text"])if len_i > max_len_test:max_len_test = len_imax_len_train, max_len_val, max_len_test
(142, 139, 167)
So we define the maximum sequence in general as 130 for tokeniaztion
max_len = 130
We are interested in the tokenized dataset, not the raw sequences, so we create a tokenizer
max_len = 130from transformers import AutoTokenizercheckpoints = "cardiffnlp/twitter-roberta-base-irony"tokenizer = AutoTokenizer.from_pretrained(checkpoints)
We create a tokenization function
max_len = 130from transformers import AutoTokenizercheckpoints = "cardiffnlp/twitter-roberta-base-irony"tokenizer = AutoTokenizer.from_pretrained(checkpoints)def tokenize_function(dataset):return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")
And now we tokenize the dataset
tokenized_dataset = {
"train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),
"validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),
"test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),
}
As we see now we have the tokens (input_ids
) and the attention masks (attention_mask
), but let's see what kind of data we have
max_len = 130from transformers import AutoTokenizercheckpoints = "cardiffnlp/twitter-roberta-base-irony"tokenizer = AutoTokenizer.from_pretrained(checkpoints)def tokenize_function(dataset):return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")tokenized_dataset = {"train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),"validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),"test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),}type(tokenized_dataset["train"][0]["input_ids"]), type(tokenized_dataset["train"][0]["attention_mask"]), type(tokenized_dataset["train"][0]["label"])
Map: 0%| | 0/45000 [00:00<?, ? examples/s](list, list, int)
tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])type(tokenized_dataset["train"][0]["label"]), type(tokenized_dataset["train"][0]["input_ids"]), type(tokenized_dataset["train"][0]["attention_mask"])
(torch.Tensor, torch.Tensor, torch.Tensor)
We create a dataloader
import torchfrom torch.utils.data import DataLoaderBS = 64dataloader = {"train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),"validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),"test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),}
We load the model
import torchfrom torch.utils.data import DataLoaderBS = 64dataloader = {"train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),"validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),"test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),}from transformers import AutoModelForSequenceClassificationmodel = AutoModelForSequenceClassification.from_pretrained(checkpoints)
Let's see what the model looks like
import torchfrom torch.utils.data import DataLoaderBS = 64dataloader = {"train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),"validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),"test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),}from transformers import AutoModelForSequenceClassificationmodel = AutoModelForSequenceClassification.from_pretrained(checkpoints)model
RobertaForSequenceClassification((roberta): RobertaModel((embeddings): RobertaEmbeddings((word_embeddings): Embedding(50265, 768, padding_idx=1)(position_embeddings): Embedding(514, 768, padding_idx=1)(token_type_embeddings): Embedding(1, 768)(LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False))(encoder): RobertaEncoder((layer): ModuleList((0-11): 12 x RobertaLayer((attention): RobertaAttention((self): RobertaSelfAttention((query): Linear(in_features=768, out_features=768, bias=True)(key): Linear(in_features=768, out_features=768, bias=True)(value): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.1, inplace=False))(output): RobertaSelfOutput((dense): Linear(in_features=768, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(intermediate): RobertaIntermediate((dense): Linear(in_features=768, out_features=3072, bias=True)(intermediate_act_fn): GELUActivation())(output): RobertaOutput((dense): Linear(in_features=3072, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False))))))(classifier): RobertaClassificationHead((dense): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.1, inplace=False)(out_proj): Linear(in_features=768, out_features=2, bias=True)))
Let's take a look at its last layer
model.classifier.out_proj
Linear(in_features=768, out_features=2, bias=True)
model.classifier.out_proj.in_features, model.classifier.out_proj.out_features
(768, 2)
We have seen that our dataset has 20 classes, but this model is trained for 2 classes, so we have to modify the last layer
model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)model.classifier.out_proj
Linear(in_features=768, out_features=20, bias=True)
Now it is
Now we create a loss function
loss_function = torch.nn.CrossEntropyLoss()
An optimizer
loss_function = torch.nn.CrossEntropyLoss()from torch.optim import Adamoptimizer = Adam(model.parameters(), lr=5e-4)
And finally a metric
loss_function = torch.nn.CrossEntropyLoss()from torch.optim import Adamoptimizer = Adam(model.parameters(), lr=5e-4)import evaluatemetric = evaluate.load("accuracy")
Let's check that everything is all right with a sample
loss_function = torch.nn.CrossEntropyLoss()from torch.optim import Adamoptimizer = Adam(model.parameters(), lr=5e-4)import evaluatemetric = evaluate.load("accuracy")sample = next(iter(dataloader["train"]))
loss_function = torch.nn.CrossEntropyLoss()from torch.optim import Adamoptimizer = Adam(model.parameters(), lr=5e-4)import evaluatemetric = evaluate.load("accuracy")sample = next(iter(dataloader["train"]))sample["input_ids"].shape, sample["attention_mask"].shape
(torch.Size([64, 130]), torch.Size([64, 130]))
Now we put that sample into the model
model.to("cuda")ouputs = model(input_ids=sample["input_ids"].to("cuda"), attention_mask=sample["attention_mask"].to("cuda"))ouputs.logits.shape
torch.Size([64, 20])
We see that the model outputs 64 batches, which is fine, because we set BS = 20
and each with 20 outputs, which is fine because we changed the model to output 20 values.
We obtain the one with the highest value
predictions = torch.argmax(ouputs.logits, axis=-1)predictions.shape
torch.Size([64])
We obtain the loss
loss = loss_function(ouputs.logits, sample["label"].to("cuda"))loss.item()
2.9990389347076416
And the accuracy
accuracy = metric.compute(predictions=predictions, references=sample["label"])["accuracy"]accuracy
0.015625
We can now create a small training loop
from fastprogress.fastprogress import master_bar, progress_bar
epochs = 1
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
master_progress_bar = master_bar(range(epochs))
for i in master_progress_bar:
model.train()
progress_bar_train = progress_bar(dataloader["train"], parent=master_progress_bar)
for batch in progress_bar_train:
optimizer.zero_grad()
input_ids = batch["input_ids"].to(device)
attention_mask = batch["attention_mask"].to(device)
labels = batch["label"].to(device)
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
loss = loss_function(outputs['logits'], labels)
master_progress_bar.child.comment = f'loss: {loss}'
loss.backward()
optimizer.step()
model.eval()
progress_bar_validation = progress_bar(dataloader["validation"], parent=master_progress_bar)
for batch in progress_bar_validation:
input_ids = batch["input_ids"].to(device)
attention_mask = batch["attention_mask"].to(device)
labels = batch["label"].to(device)
with torch.no_grad():
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
predictions = torch.argmax(outputs['logits'], axis=-1)
accuracy = metric.add_batch(predictions=predictions, references=labels)
accuracy = metric.compute()
master_progress_bar.main_bar.comment = f"Validation accuracy: {accuracy['accuracy']}\n"
Script with the code base
In most of the accelerate
documentation it is explained how to use accelerate
with scripts, so for the moment we will do it like this and at the end we will explain how to do it with a notebook
First we are going to create a folder where we are going to save the scripts
from fastprogress.fastprogress import master_bar, progress_barepochs = 1device = torch.device("cuda" if torch.cuda.is_available() else "cpu")model.to(device)master_progress_bar = master_bar(range(epochs))for i in master_progress_bar:model.train()progress_bar_train = progress_bar(dataloader["train"], parent=master_progress_bar)for batch in progress_bar_train:optimizer.zero_grad()input_ids = batch["input_ids"].to(device)attention_mask = batch["attention_mask"].to(device)labels = batch["label"].to(device)outputs = model(input_ids=input_ids, attention_mask=attention_mask)loss = loss_function(outputs['logits'], labels)master_progress_bar.child.comment = f'loss: {loss}'loss.backward()optimizer.step()model.eval()progress_bar_validation = progress_bar(dataloader["validation"], parent=master_progress_bar)for batch in progress_bar_validation:input_ids = batch["input_ids"].to(device)attention_mask = batch["attention_mask"].to(device)labels = batch["label"].to(device)with torch.no_grad():outputs = model(input_ids=input_ids, attention_mask=attention_mask)predictions = torch.argmax(outputs['logits'], axis=-1)accuracy = metric.add_batch(predictions=predictions, references=labels)accuracy = metric.compute()master_progress_bar.main_bar.comment = f"Validation accuracy: {accuracy['accuracy']} "!mkdir accelerate_scripts
Now we write the base code in a script
from fastprogress.fastprogress import master_bar, progress_barepochs = 1device = torch.device("cuda" if torch.cuda.is_available() else "cpu")model.to(device)master_progress_bar = master_bar(range(epochs))for i in master_progress_bar:model.train()progress_bar_train = progress_bar(dataloader["train"], parent=master_progress_bar)for batch in progress_bar_train:optimizer.zero_grad()input_ids = batch["input_ids"].to(device)attention_mask = batch["attention_mask"].to(device)labels = batch["label"].to(device)outputs = model(input_ids=input_ids, attention_mask=attention_mask)loss = loss_function(outputs['logits'], labels)master_progress_bar.child.comment = f'loss: {loss}'loss.backward()optimizer.step()model.eval()progress_bar_validation = progress_bar(dataloader["validation"], parent=master_progress_bar)for batch in progress_bar_validation:input_ids = batch["input_ids"].to(device)attention_mask = batch["attention_mask"].to(device)labels = batch["label"].to(device)with torch.no_grad():outputs = model(input_ids=input_ids, attention_mask=attention_mask)predictions = torch.argmax(outputs['logits'], axis=-1)accuracy = metric.add_batch(predictions=predictions, references=labels)accuracy = metric.compute()master_progress_bar.main_bar.comment = f"Validation accuracy: {accuracy['accuracy']}\n"!mkdir accelerate_scripts%%writefile accelerate_scripts/01_code_base.pyimport torchfrom torch.utils.data import DataLoaderfrom torch.optim import Adamfrom datasets import load_datasetfrom transformers import AutoTokenizer, AutoModelForSequenceClassificationimport evaluatefrom fastprogress.fastprogress import master_bar, progress_bardataset = load_dataset("tweet_eval", "emoji")num_classes = len(dataset["train"].info.features["label"].names)max_len = 130checkpoints = "cardiffnlp/twitter-roberta-base-irony"tokenizer = AutoTokenizer.from_pretrained(checkpoints)def tokenize_function(dataset):return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")tokenized_dataset = {"train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),"validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),"test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),}tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])BS = 64dataloader = {"train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),"validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),"test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),}model = AutoModelForSequenceClassification.from_pretrained(checkpoints)model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)loss_function = torch.nn.CrossEntropyLoss()optimizer = Adam(model.parameters(), lr=5e-4)metric = evaluate.load("accuracy")EPOCHS = 1device = torch.device("cuda" if torch.cuda.is_available() else "cpu")model.to(device)master_progress_bar = master_bar(range(EPOCHS))for i in master_progress_bar:model.train()progress_bar_train = progress_bar(dataloader["train"], parent=master_progress_bar)for batch in progress_bar_train:optimizer.zero_grad()input_ids = batch["input_ids"].to(device)attention_mask = batch["attention_mask"].to(device)labels = batch["label"].to(device)outputs = model(input_ids=input_ids, attention_mask=attention_mask)loss = loss_function(outputs['logits'], labels)master_progress_bar.child.comment = f'loss: {loss}'loss.backward()optimizer.step()model.eval()progress_bar_validation = progress_bar(dataloader["validation"], parent=master_progress_bar)for batch in progress_bar_validation:input_ids = batch["input_ids"].to(device)attention_mask = batch["attention_mask"].to(device)labels = batch["label"].to(device)with torch.no_grad():outputs = model(input_ids=input_ids, attention_mask=attention_mask)predictions = torch.argmax(outputs['logits'], axis=-1)accuracy = metric.add_batch(predictions=predictions, references=labels)accuracy = metric.compute()master_progress_bar.main_bar.comment = f"Validation accuracy: {accuracy['accuracy']}\n"print(f"Accuracy = {accuracy['accuracy']}")
Overwriting accelerate_scripts/01_code_base.py
And now we run it
%%time!python accelerate_scripts/01_code_base.py
Accuracy = 0.2112CPU times: user 2.12 s, sys: 391 ms, total: 2.51 sWall time: 3min 36s
We can see that on my computer it took about 3.5 minutes.
Code with accelerate
Now we replace some things
- First we import
Accelerator
and initialize it.
from accelerate import Accelerator
accelerator = Accelerator()
- We no longer do the typical
torch.device("cuda" if torch.cuda.is_available() else "cpu")
- Instead, we let
accelerate
choose the device by means of
device = accelerator.device
- We pass the relevant elements for training through the
prepare
method and no longer domodel.to(device)
.
model, optimizer, dataloader["train"], dataloader["validation"] = preprare(model, optimizer, dataloader["train"], dataloader["validation"])
We no longer send the data and the model to the GPU with
.to(device)
sinceaccelerate
has taken care of it with theprepare
method.Instead of doing the backpropagation with
loss.backward()
we letaccelerate
do it withloss.backward()
.
accelerator.backward(loss)
- When calculating the metric in the validation loop, we need to collect the values of all the points, in case we are doing a distributed training.
predictions = accelerator.gather_for_metrics(predictions)
%%writefile accelerate_scripts/02_accelerate_base_code.pyimport torchfrom torch.utils.data import DataLoaderfrom torch.optim import Adamfrom datasets import load_datasetfrom transformers import AutoTokenizer, AutoModelForSequenceClassificationimport evaluatefrom fastprogress.fastprogress import master_bar, progress_bar# Importamos e inicializamos Acceleratorfrom accelerate import Acceleratoraccelerator = Accelerator()dataset = load_dataset("tweet_eval", "emoji")num_classes = len(dataset["train"].info.features["label"].names)max_len = 130checkpoints = "cardiffnlp/twitter-roberta-base-irony"tokenizer = AutoTokenizer.from_pretrained(checkpoints)def tokenize_function(dataset):return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")tokenized_dataset = {"train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),"validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),"test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),}tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])BS = 64dataloader = {"train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),"validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),"test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),}model = AutoModelForSequenceClassification.from_pretrained(checkpoints)model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)loss_function = torch.nn.CrossEntropyLoss()optimizer = Adam(model.parameters(), lr=5e-4)metric = evaluate.load("accuracy")EPOCHS = 1# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")device = accelerator.device# model.to(device)model, optimizer, dataloader["train"], dataloader["validation"] = accelerator.prepare(model, optimizer, dataloader["train"], dataloader["validation"])master_progress_bar = master_bar(range(EPOCHS))for i in master_progress_bar:model.train()progress_bar_train = progress_bar(dataloader["train"], parent=master_progress_bar)for batch in progress_bar_train:optimizer.zero_grad()input_ids = batch["input_ids"]#.to(device)attention_mask = batch["attention_mask"]#.to(device)labels = batch["label"]#.to(device)outputs = model(input_ids=input_ids, attention_mask=attention_mask)loss = loss_function(outputs['logits'], labels)master_progress_bar.child.comment = f'loss: {loss}'# loss.backward()accelerator.backward(loss)optimizer.step()print(f"End of training epoch {i}, outputs['logits'].shape: {outputs['logits'].shape}, labels.shape: {labels.shape}")model.eval()progress_bar_validation = progress_bar(dataloader["validation"], parent=master_progress_bar)for batch in progress_bar_validation:input_ids = batch["input_ids"]#.to(device)attention_mask = batch["attention_mask"]#.to(device)labels = batch["label"]#.to(device)with torch.no_grad():outputs = model(input_ids=input_ids, attention_mask=attention_mask)predictions = torch.argmax(outputs['logits'], axis=-1)# Recopilamos las predicciones de todos los dispositivospredictions = accelerator.gather_for_metrics(predictions)labels = accelerator.gather_for_metrics(labels)accuracy = metric.add_batch(predictions=predictions, references=labels)accuracy = metric.compute()print(f"End of validation epoch {i}, outputs['logits'].shape: {outputs['logits'].shape}, labels.shape: {labels.shape}")master_progress_bar.main_bar.comment = f"Validation accuracy: {accuracy['accuracy']}\n"print(f"Accuracy = {accuracy['accuracy']}")
Overwriting accelerate_scripts/02_accelerate_base_code.py
If you notice I have added these two lines print(f "End of training epoch {i}, outputs['logits'].shape: {outputs['logits'].shape}, labels.shape: {labels.shape}")
and the line print(f "End of validation epoch {i}, outputs['logits'].shape: {outputs['logits'].shape}, labels.shape: {labels.shape}")
, I added them on purpose because they will reveal something very important
Now we execute it, to execute the accelerate
scripts it is done with the command accelerate launch
.
accelerate launch script.py
%%time!accelerate launch accelerate_scripts/02_accelerate_base_code.py
End of training epoch 0, outputs['logits'].shape: torch.Size([64, 20]), labels.shape: torch.Size([64])End of training epoch 0, outputs['logits'].shape: torch.Size([64, 20]), labels.shape: torch.Size([64])End of validation epoch 0, outputs['logits'].shape: torch.Size([64, 20]), labels.shape: torch.Size([8])Accuracy = 0.206End of validation epoch 0, outputs['logits'].shape: torch.Size([64, 20]), labels.shape: torch.Size([8])Accuracy = 0.206CPU times: user 1.6 s, sys: 272 ms, total: 1.88 sWall time: 2min 37s
We see that before it took about 3 and a half minutes and now it takes about 2 and a half minutes. Quite an improvement. Also if we look at the print
s we can see that they have been printed twice.
And how can this be? Because accelerate
has parallelized the training on the two GPUs I have, so it has been much faster.
Also, when I ran the first script, that is, when I did not use accelerate
, the GPU was almost full, while when I ran the second one, that is, the one using accelerate
, the two GPUs were very little used, so we can increase the batch size to try to fill both of them, let's go for it!
%%writefile accelerate_scripts/03_accelerate_base_code_more_bs.pyimport torchfrom torch.utils.data import DataLoaderfrom torch.optim import Adamfrom datasets import load_datasetfrom transformers import AutoTokenizer, AutoModelForSequenceClassificationimport evaluatefrom fastprogress.fastprogress import master_bar, progress_bar# Importamos e inicializamos Acceleratorfrom accelerate import Acceleratoraccelerator = Accelerator()dataset = load_dataset("tweet_eval", "emoji")num_classes = len(dataset["train"].info.features["label"].names)max_len = 130checkpoints = "cardiffnlp/twitter-roberta-base-irony"tokenizer = AutoTokenizer.from_pretrained(checkpoints)def tokenize_function(dataset):return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")tokenized_dataset = {"train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),"validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),"test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),}tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])BS = 128dataloader = {"train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),"validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),"test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),}model = AutoModelForSequenceClassification.from_pretrained(checkpoints)model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)loss_function = torch.nn.CrossEntropyLoss()optimizer = Adam(model.parameters(), lr=5e-4)metric = evaluate.load("accuracy")EPOCHS = 1# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")device = accelerator.device# model.to(device)model, optimizer, dataloader["train"], dataloader["validation"] = accelerator.prepare(model, optimizer, dataloader["train"], dataloader["validation"])master_progress_bar = master_bar(range(EPOCHS))for i in master_progress_bar:model.train()progress_bar_train = progress_bar(dataloader["train"], parent=master_progress_bar)for batch in progress_bar_train:optimizer.zero_grad()input_ids = batch["input_ids"]#.to(device)attention_mask = batch["attention_mask"]#.to(device)labels = batch["label"]#.to(device)outputs = model(input_ids=input_ids, attention_mask=attention_mask)loss = loss_function(outputs['logits'], labels)master_progress_bar.child.comment = f'loss: {loss}'# loss.backward()accelerator.backward(loss)optimizer.step()model.eval()progress_bar_validation = progress_bar(dataloader["validation"], parent=master_progress_bar)for batch in progress_bar_validation:input_ids = batch["input_ids"]#.to(device)attention_mask = batch["attention_mask"]#.to(device)labels = batch["label"]#.to(device)with torch.no_grad():outputs = model(input_ids=input_ids, attention_mask=attention_mask)predictions = torch.argmax(outputs['logits'], axis=-1)# Recopilamos las predicciones de todos los dispositivospredictions = accelerator.gather_for_metrics(predictions)labels = accelerator.gather_for_metrics(labels)accuracy = metric.add_batch(predictions=predictions, references=labels)accuracy = metric.compute()master_progress_bar.main_bar.comment = f"Validation accuracy: {accuracy['accuracy']}\n"print(f"Accuracy = {accuracy['accuracy']}")
Overwriting accelerate_scripts/03_accelerate_base_code_more_bs.py
I have removed the extra prints, because we have already seen that the code is running on both GPUs, and I have increased the batch size from 64 to 128.
%%time!accelerate launch accelerate_scripts/03_accelerate_base_code_more_bs.py
Accuracy = 0.1052Accuracy = 0.1052CPU times: user 1.41 s, sys: 180 ms, total: 1.59 sWall time: 2min 22s
Increasing the batch size has reduced the execution time by a few seconds.
Process execution
Execution of code in a single process
Before we have seen that the print
s were printed twice, this is because accelerate
creates as many processes as devices where the code is executed, in my case it creates two processes because I have two GPUs.
However, not all code should be executed in all processes, for example, the print
s slow down the code too much to execute it several times, if checkpoints are saved, they would be saved twice, etc.
In order to execute part of a code in a single process you have to encapsulate it in a function and decorate it with accelerator.on_local_main_process
, for example in the following code you will see that I created the following function
@accelerator.on_local_main_process
def print_something(something):
print(something)
Another option is to put the code inside an if accelerator.is_local_main_process
as in the following code
if accelerator.is_local_main_process:
print("Something")
%%writefile accelerate_scripts/04_accelerate_base_code_some_code_in_one_process.pyimport torchfrom torch.utils.data import DataLoaderfrom torch.optim import Adamfrom datasets import load_datasetfrom transformers import AutoTokenizer, AutoModelForSequenceClassificationimport evaluatefrom fastprogress.fastprogress import master_bar, progress_bar# Importamos e inicializamos Acceleratorfrom accelerate import Acceleratoraccelerator = Accelerator()dataset = load_dataset("tweet_eval", "emoji")num_classes = len(dataset["train"].info.features["label"].names)max_len = 130checkpoints = "cardiffnlp/twitter-roberta-base-irony"tokenizer = AutoTokenizer.from_pretrained(checkpoints)def tokenize_function(dataset):return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")tokenized_dataset = {"train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),"validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),"test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),}tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])BS = 128dataloader = {"train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),"validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),"test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),}model = AutoModelForSequenceClassification.from_pretrained(checkpoints)model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)loss_function = torch.nn.CrossEntropyLoss()optimizer = Adam(model.parameters(), lr=5e-4)metric = evaluate.load("accuracy")EPOCHS = 1# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")device = accelerator.device# model.to(device)model, optimizer, dataloader["train"], dataloader["validation"] = accelerator.prepare(model, optimizer, dataloader["train"], dataloader["validation"])@accelerator.on_local_main_processdef print_something(something):print(something)master_progress_bar = master_bar(range(EPOCHS))for i in master_progress_bar:model.train()progress_bar_train = progress_bar(dataloader["train"], parent=master_progress_bar)for batch in progress_bar_train:optimizer.zero_grad()input_ids = batch["input_ids"]#.to(device)attention_mask = batch["attention_mask"]#.to(device)labels = batch["label"]#.to(device)outputs = model(input_ids=input_ids, attention_mask=attention_mask)loss = loss_function(outputs['logits'], labels)master_progress_bar.child.comment = f'loss: {loss}'# loss.backward()accelerator.backward(loss)optimizer.step()model.eval()progress_bar_validation = progress_bar(dataloader["validation"], parent=master_progress_bar)for batch in progress_bar_validation:input_ids = batch["input_ids"]#.to(device)attention_mask = batch["attention_mask"]#.to(device)labels = batch["label"]#.to(device)with torch.no_grad():outputs = model(input_ids=input_ids, attention_mask=attention_mask)predictions = torch.argmax(outputs['logits'], axis=-1)# Recopilamos las predicciones de todos los dispositivospredictions = accelerator.gather_for_metrics(predictions)labels = accelerator.gather_for_metrics(labels)accuracy = metric.add_batch(predictions=predictions, references=labels)accuracy = metric.compute()master_progress_bar.main_bar.comment = f"Validation accuracy: {accuracy['accuracy']}\n"# print(f"Accuracy = {accuracy['accuracy']}")print_something(f"Accuracy = {accuracy['accuracy']}")if accelerator.is_local_main_process:print(f"End of script with {accuracy['accuracy']} accuracy")
Overwriting accelerate_scripts/04_accelerate_base_code_some_code_in_one_process.py
Let's run it and see
%%time!accelerate launch accelerate_scripts/04_accelerate_base_code_some_code_in_one_process.py
Accuracy = 0.2098End of script with 0.2098 accuracyCPU times: user 1.38 s, sys: 197 ms, total: 1.58 sWall time: 2min 22s
Now the print has only been printed once
However, although you don't see much, progress bars are executed in each process.
I have not found a way to avoid this with fastprogress
progress bars, but with tqdm
progress bars, so I will replace fastprogress
progress bars with tqdm
progress bars and to make them run in a single process add the argument disable=not accelerator.is_local_main_process
.
%%writefile accelerate_scripts/05_accelerate_base_code_some_code_in_one_process.pyimport torchfrom torch.utils.data import DataLoaderfrom torch.optim import Adamfrom datasets import load_datasetfrom transformers import AutoTokenizer, AutoModelForSequenceClassificationimport evaluateimport tqdm# Importamos e inicializamos Acceleratorfrom accelerate import Acceleratoraccelerator = Accelerator()dataset = load_dataset("tweet_eval", "emoji")num_classes = len(dataset["train"].info.features["label"].names)max_len = 130checkpoints = "cardiffnlp/twitter-roberta-base-irony"tokenizer = AutoTokenizer.from_pretrained(checkpoints)def tokenize_function(dataset):return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")tokenized_dataset = {"train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),"validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),"test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),}tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])BS = 128dataloader = {"train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),"validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),"test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),}model = AutoModelForSequenceClassification.from_pretrained(checkpoints)model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)loss_function = torch.nn.CrossEntropyLoss()optimizer = Adam(model.parameters(), lr=5e-4)metric = evaluate.load("accuracy")EPOCHS = 1# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")device = accelerator.device# model.to(device)model, optimizer, dataloader["train"], dataloader["validation"] = accelerator.prepare(model, optimizer, dataloader["train"], dataloader["validation"])@accelerator.on_local_main_processdef print_something(something):print(something)for i in range(EPOCHS):model.train()# progress_bar_train = progress_bar(dataloader["train"], parent=master_progress_bar)progress_bar_train = tqdm.tqdm(dataloader["train"], disable=not accelerator.is_local_main_process)for batch in progress_bar_train:optimizer.zero_grad()input_ids = batch["input_ids"]#.to(device)attention_mask = batch["attention_mask"]#.to(device)labels = batch["label"]#.to(device)outputs = model(input_ids=input_ids, attention_mask=attention_mask)loss = loss_function(outputs['logits'], labels)# master_progress_bar.child.comment = f'loss: {loss}'# loss.backward()accelerator.backward(loss)optimizer.step()model.eval()# progress_bar_validation = progress_bar(dataloader["validation"], parent=master_progress_bar)progress_bar_validation = tqdm.tqdm(dataloader["validation"], disable=not accelerator.is_local_main_process)for batch in progress_bar_validation:input_ids = batch["input_ids"]#.to(device)attention_mask = batch["attention_mask"]#.to(device)labels = batch["label"]#.to(device)with torch.no_grad():outputs = model(input_ids=input_ids, attention_mask=attention_mask)predictions = torch.argmax(outputs['logits'], axis=-1)# Recopilamos las predicciones de todos los dispositivospredictions = accelerator.gather_for_metrics(predictions)labels = accelerator.gather_for_metrics(labels)accuracy = metric.add_batch(predictions=predictions, references=labels)accuracy = metric.compute()# print(f"Accuracy = {accuracy['accuracy']}")print_something(f"Accuracy = {accuracy['accuracy']}")if accelerator.is_local_main_process:print(f"End of script with {accuracy['accuracy']} accuracy")
Overwriting accelerate_scripts/05_accelerate_base_code_some_code_in_one_process.py
%%time!accelerate launch accelerate_scripts/05_accelerate_base_code_some_code_in_one_process.py
100%|█████████████████████████████████████████| 176/176 [02:01<00:00, 1.45it/s]100%|███████████████████████████████████████████| 20/20 [00:06<00:00, 3.30it/s]Accuracy = 0.2166End of script with 0.2166 accuracyCPU times: user 1.33 s, sys: 195 ms, total: 1.52 sWall time: 2min 22s
We have shown an example of how to print in a single process, and this has been a way to execute processes in a single process. But if you just want to print in a single process you can use the print
method of accelerate
. Let's see the same example of before with this method
%%writefile accelerate_scripts/06_accelerate_base_code_print_one_process.pyimport torchfrom torch.utils.data import DataLoaderfrom torch.optim import Adamfrom datasets import load_datasetfrom transformers import AutoTokenizer, AutoModelForSequenceClassificationimport evaluateimport tqdm# Importamos e inicializamos Acceleratorfrom accelerate import Acceleratoraccelerator = Accelerator()dataset = load_dataset("tweet_eval", "emoji")num_classes = len(dataset["train"].info.features["label"].names)max_len = 130checkpoints = "cardiffnlp/twitter-roberta-base-irony"tokenizer = AutoTokenizer.from_pretrained(checkpoints)def tokenize_function(dataset):return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")tokenized_dataset = {"train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),"validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),"test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),}tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])BS = 128dataloader = {"train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),"validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),"test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),}model = AutoModelForSequenceClassification.from_pretrained(checkpoints)model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)loss_function = torch.nn.CrossEntropyLoss()optimizer = Adam(model.parameters(), lr=5e-4)metric = evaluate.load("accuracy")EPOCHS = 1# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")device = accelerator.device# model.to(device)model, optimizer, dataloader["train"], dataloader["validation"] = accelerator.prepare(model, optimizer, dataloader["train"], dataloader["validation"])for i in range(EPOCHS):model.train()# progress_bar_train = progress_bar(dataloader["train"], parent=master_progress_bar)progress_bar_train = tqdm.tqdm(dataloader["train"], disable=not accelerator.is_local_main_process)for batch in progress_bar_train:optimizer.zero_grad()input_ids = batch["input_ids"]#.to(device)attention_mask = batch["attention_mask"]#.to(device)labels = batch["label"]#.to(device)outputs = model(input_ids=input_ids, attention_mask=attention_mask)loss = loss_function(outputs['logits'], labels)# master_progress_bar.child.comment = f'loss: {loss}'# loss.backward()accelerator.backward(loss)optimizer.step()model.eval()# progress_bar_validation = progress_bar(dataloader["validation"], parent=master_progress_bar)progress_bar_validation = tqdm.tqdm(dataloader["validation"], disable=not accelerator.is_local_main_process)for batch in progress_bar_validation:input_ids = batch["input_ids"]#.to(device)attention_mask = batch["attention_mask"]#.to(device)labels = batch["label"]#.to(device)with torch.no_grad():outputs = model(input_ids=input_ids, attention_mask=attention_mask)predictions = torch.argmax(outputs['logits'], axis=-1)# Recopilamos las predicciones de todos los dispositivospredictions = accelerator.gather_for_metrics(predictions)labels = accelerator.gather_for_metrics(labels)accuracy = metric.add_batch(predictions=predictions, references=labels)accuracy = metric.compute()# print(f"Accuracy = {accuracy['accuracy']}")accelerator.print(f"Accuracy = {accuracy['accuracy']}")if accelerator.is_local_main_process:print(f"End of script with {accuracy['accuracy']} accuracy")
Writing accelerate_scripts/06_accelerate_base_code_print_one_process.py
We run it
%%time!accelerate launch accelerate_scripts/06_accelerate_base_code_print_one_process.py
Map: 100%|██████████████████████| 45000/45000 [00:02<00:00, 15433.52 examples/s]Map: 100%|████████████████████████| 5000/5000 [00:00<00:00, 11406.61 examples/s]Map: 100%|██████████████████████| 45000/45000 [00:02<00:00, 15036.87 examples/s]Map: 100%|██████████████████████| 50000/50000 [00:03<00:00, 14932.76 examples/s]Map: 100%|██████████████████████| 50000/50000 [00:03<00:00, 14956.60 examples/s]100%|█████████████████████████████████████████| 176/176 [02:00<00:00, 1.46it/s]100%|███████████████████████████████████████████| 20/20 [00:05<00:00, 3.33it/s]Accuracy = 0.2134End of script with 0.2134 accuracyCPU times: user 1.4 s, sys: 189 ms, total: 1.59 sWall time: 2min 27s
Code execution in all processes
However there is code that must be executed in all processes, for example if we upload the checkpoints to the hub, so here we have two options, encapsulate the code in a function and decorate it with accelerator.on_main_process
.
@accelerator.on_main_process
def do_my_thing():
"Something done once per server"
do_thing_once()
or put the code inside an if accelerator.is_main_process
.
if accelerator.is_main_process:
repo.push_to_hub()
As we are training just to show the accelerate
library and the model we are training is not good, there is no sense now to upload the checkpoints to the hub, so I am going to make an example with print
s
%%writefile accelerate_scripts/07_accelerate_base_code_some_code_in_all_process.pyimport torchfrom torch.utils.data import DataLoaderfrom torch.optim import Adamfrom datasets import load_datasetfrom transformers import AutoTokenizer, AutoModelForSequenceClassificationimport evaluateimport tqdm# Importamos e inicializamos Acceleratorfrom accelerate import Acceleratoraccelerator = Accelerator()dataset = load_dataset("tweet_eval", "emoji")num_classes = len(dataset["train"].info.features["label"].names)max_len = 130checkpoints = "cardiffnlp/twitter-roberta-base-irony"tokenizer = AutoTokenizer.from_pretrained(checkpoints)def tokenize_function(dataset):return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")tokenized_dataset = {"train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),"validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),"test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),}tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])BS = 128dataloader = {"train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),"validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),"test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),}model = AutoModelForSequenceClassification.from_pretrained(checkpoints)model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)loss_function = torch.nn.CrossEntropyLoss()optimizer = Adam(model.parameters(), lr=5e-4)metric = evaluate.load("accuracy")EPOCHS = 1# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")device = accelerator.device# model.to(device)model, optimizer, dataloader["train"], dataloader["validation"] = accelerator.prepare(model, optimizer, dataloader["train"], dataloader["validation"])@accelerator.on_local_main_processdef print_in_one_process(something):print(something)@accelerator.on_main_processdef print_in_all_processes(something):print(something)for i in range(EPOCHS):model.train()progress_bar_train = tqdm.tqdm(dataloader["train"], disable=not accelerator.is_local_main_process)for batch in progress_bar_train:optimizer.zero_grad()input_ids = batch["input_ids"]#.to(device)attention_mask = batch["attention_mask"]#.to(device)labels = batch["label"]#.to(device)outputs = model(input_ids=input_ids, attention_mask=attention_mask)loss = loss_function(outputs['logits'], labels)# loss.backward()accelerator.backward(loss)optimizer.step()model.eval()progress_bar_validation = tqdm.tqdm(dataloader["validation"], disable=not accelerator.is_local_main_process)for batch in progress_bar_validation:input_ids = batch["input_ids"]#.to(device)attention_mask = batch["attention_mask"]#.to(device)labels = batch["label"]#.to(device)with torch.no_grad():outputs = model(input_ids=input_ids, attention_mask=attention_mask)predictions = torch.argmax(outputs['logits'], axis=-1)# Recopilamos las predicciones de todos los dispositivospredictions = accelerator.gather_for_metrics(predictions)labels = accelerator.gather_for_metrics(labels)accuracy = metric.add_batch(predictions=predictions, references=labels)accuracy = metric.compute()print_in_one_process(f"Accuracy = {accuracy['accuracy']}")if accelerator.is_local_main_process:print(f"End of script with {accuracy['accuracy']} accuracy")print_in_all_processes(f"All process: Accuracy = {accuracy['accuracy']}")if accelerator.is_main_process:print(f"All process: End of script with {accuracy['accuracy']} accuracy")
Overwriting accelerate_scripts/06_accelerate_base_code_some_code_in_all_process.py
We run it to see
%%time!accelerate launch accelerate_scripts/07_accelerate_base_code_some_code_in_all_process.py
Map: 100%|██████████████████████| 45000/45000 [00:03<00:00, 14518.44 examples/s]Map: 100%|██████████████████████| 45000/45000 [00:03<00:00, 14368.77 examples/s]Map: 100%|████████████████████████| 5000/5000 [00:00<00:00, 16466.33 examples/s]Map: 100%|████████████████████████| 5000/5000 [00:00<00:00, 14806.14 examples/s]Map: 100%|██████████████████████| 50000/50000 [00:03<00:00, 14253.33 examples/s]Map: 100%|██████████████████████| 50000/50000 [00:03<00:00, 14337.07 examples/s]100%|█████████████████████████████████████████| 176/176 [02:00<00:00, 1.46it/s]100%|███████████████████████████████████████████| 20/20 [00:05<00:00, 3.34it/s]Accuracy = 0.2092End of script with 0.2092 accuracyAll process: Accuracy = 0.2092All process: End of script with 0.2092 accuracyCPU times: user 1.42 s, sys: 216 ms, total: 1.64 sWall time: 2min 27s
Execution of code in the process X
Finally we can specify in which process we want to execute code, for this we must create a function and decorate it with @accelerator.on_process(process_index=0)
.
@accelerator.on_process(process_index=0)
def do_my_thing():
"Something done on process index 0"
do_thing_on_index_zero()
or decorate it with @accelerator.on_local_process(local_process_idx=0)
.
@accelerator.on_local_process(local_process_index=0)
def do_my_thing():
"Something done on process index 0 on each server".
do_thing_on_index_zero_on_each_server()
Here I have put the process 0, but you can put any number
%%writefile accelerate_scripts/08_accelerate_base_code_some_code_in_some_process.pyimport torchfrom torch.utils.data import DataLoaderfrom torch.optim import Adamfrom datasets import load_datasetfrom transformers import AutoTokenizer, AutoModelForSequenceClassificationimport evaluateimport tqdm# Importamos e inicializamos Acceleratorfrom accelerate import Acceleratoraccelerator = Accelerator()dataset = load_dataset("tweet_eval", "emoji")num_classes = len(dataset["train"].info.features["label"].names)max_len = 130checkpoints = "cardiffnlp/twitter-roberta-base-irony"tokenizer = AutoTokenizer.from_pretrained(checkpoints)def tokenize_function(dataset):return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")tokenized_dataset = {"train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),"validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),"test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),}tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])BS = 128dataloader = {"train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),"validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),"test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),}model = AutoModelForSequenceClassification.from_pretrained(checkpoints)model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)loss_function = torch.nn.CrossEntropyLoss()optimizer = Adam(model.parameters(), lr=5e-4)metric = evaluate.load("accuracy")EPOCHS = 1# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")device = accelerator.device# model.to(device)model, optimizer, dataloader["train"], dataloader["validation"] = accelerator.prepare(model, optimizer, dataloader["train"], dataloader["validation"])@accelerator.on_local_main_processdef print_in_one_process(something):print(something)@accelerator.on_main_processdef print_in_all_processes(something):print(something)@accelerator.on_process(process_index=0)def print_in_process_0(something):print("Process 0: " + something)@accelerator.on_local_process(local_process_index=1)def print_in_process_1(something):print("Process 1: " + something)for i in range(EPOCHS):model.train()progress_bar_train = tqdm.tqdm(dataloader["train"], disable=not accelerator.is_local_main_process)for batch in progress_bar_train:optimizer.zero_grad()input_ids = batch["input_ids"]#.to(device)attention_mask = batch["attention_mask"]#.to(device)labels = batch["label"]#.to(device)outputs = model(input_ids=input_ids, attention_mask=attention_mask)loss = loss_function(outputs['logits'], labels)# loss.backward()accelerator.backward(loss)optimizer.step()model.eval()progress_bar_validation = tqdm.tqdm(dataloader["validation"], disable=not accelerator.is_local_main_process)for batch in progress_bar_validation:input_ids = batch["input_ids"]#.to(device)attention_mask = batch["attention_mask"]#.to(device)labels = batch["label"]#.to(device)with torch.no_grad():outputs = model(input_ids=input_ids, attention_mask=attention_mask)predictions = torch.argmax(outputs['logits'], axis=-1)# Recopilamos las predicciones de todos los dispositivospredictions = accelerator.gather_for_metrics(predictions)labels = accelerator.gather_for_metrics(labels)accuracy = metric.add_batch(predictions=predictions, references=labels)accuracy = metric.compute()print_in_one_process(f"Accuracy = {accuracy['accuracy']}")if accelerator.is_local_main_process:print(f"End of script with {accuracy['accuracy']} accuracy")print_in_all_processes(f"All process: Accuracy = {accuracy['accuracy']}")if accelerator.is_main_process:print(f"All process: End of script with {accuracy['accuracy']} accuracy")print_in_process_0("End of process 0")print_in_process_1("End of process 1")
Overwriting accelerate_scripts/07_accelerate_base_code_some_code_in_some_process.py
We run it
%%time!accelerate launch accelerate_scripts/08_accelerate_base_code_some_code_in_some_process.py
Map: 100%|████████████████████████| 5000/5000 [00:00<00:00, 15735.58 examples/s]Map: 100%|██████████████████████| 50000/50000 [00:03<00:00, 14906.20 examples/s]100%|█████████████████████████████████████████| 176/176 [02:02<00:00, 1.44it/s]100%|███████████████████████████████████████████| 20/20 [00:06<00:00, 3.27it/s]Process 1: End of process 1Accuracy = 0.2128End of script with 0.2128 accuracyAll process: Accuracy = 0.2128All process: End of script with 0.2128 accuracyProcess 0: End of process 0CPU times: user 1.42 s, sys: 295 ms, total: 1.71 sWall time: 2min 37s
Synchronize processes
If we have code that must be executed in all processes, it is interesting to wait for it to finish in all processes before doing another task, so for this we use accelerator.wait_for_everyone()
.
To see this we are going to put a delay in one of the print functions in a process
I've also put a break in the training loop so that he doesn't spend too much time training, which is not what we're interested in right now.
%%writefile accelerate_scripts/09_accelerate_base_code_sync_all_process.pyimport torchfrom torch.utils.data import DataLoaderfrom torch.optim import Adamfrom datasets import load_datasetfrom transformers import AutoTokenizer, AutoModelForSequenceClassificationimport evaluateimport tqdmimport time# Importamos e inicializamos Acceleratorfrom accelerate import Acceleratoraccelerator = Accelerator()dataset = load_dataset("tweet_eval", "emoji")num_classes = len(dataset["train"].info.features["label"].names)max_len = 130checkpoints = "cardiffnlp/twitter-roberta-base-irony"tokenizer = AutoTokenizer.from_pretrained(checkpoints)def tokenize_function(dataset):return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")tokenized_dataset = {"train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),"validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),"test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),}tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])BS = 128dataloader = {"train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),"validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),"test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),}model = AutoModelForSequenceClassification.from_pretrained(checkpoints)model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)loss_function = torch.nn.CrossEntropyLoss()optimizer = Adam(model.parameters(), lr=5e-4)metric = evaluate.load("accuracy")EPOCHS = 1# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")device = accelerator.device# model.to(device)model, optimizer, dataloader["train"], dataloader["validation"] = accelerator.prepare(model, optimizer, dataloader["train"], dataloader["validation"])@accelerator.on_local_main_processdef print_in_one_process(something):print(something)@accelerator.on_main_processdef print_in_all_processes(something):print(something)@accelerator.on_process(process_index=0)def print_in_process_0(something):time.sleep(2)print("Process 0: " + something)@accelerator.on_local_process(local_process_index=1)def print_in_process_1(something):print("Process 1: " + something)for i in range(EPOCHS):model.train()progress_bar_train = tqdm.tqdm(dataloader["train"], disable=not accelerator.is_local_main_process)for batch in progress_bar_train:optimizer.zero_grad()input_ids = batch["input_ids"]#.to(device)attention_mask = batch["attention_mask"]#.to(device)labels = batch["label"]#.to(device)outputs = model(input_ids=input_ids, attention_mask=attention_mask)loss = loss_function(outputs['logits'], labels)# loss.backward()accelerator.backward(loss)optimizer.step()breakmodel.eval()progress_bar_validation = tqdm.tqdm(dataloader["validation"], disable=not accelerator.is_local_main_process)for batch in progress_bar_validation:input_ids = batch["input_ids"]#.to(device)attention_mask = batch["attention_mask"]#.to(device)labels = batch["label"]#.to(device)with torch.no_grad():outputs = model(input_ids=input_ids, attention_mask=attention_mask)predictions = torch.argmax(outputs['logits'], axis=-1)# Recopilamos las predicciones de todos los dispositivospredictions = accelerator.gather_for_metrics(predictions)labels = accelerator.gather_for_metrics(labels)accuracy = metric.add_batch(predictions=predictions, references=labels)accuracy = metric.compute()print_in_one_process(f"Accuracy = {accuracy['accuracy']}")if accelerator.is_local_main_process:print(f"End of script with {accuracy['accuracy']} accuracy")print_in_all_processes(f"All process: Accuracy = {accuracy['accuracy']}")if accelerator.is_main_process:print(f"All process: End of script with {accuracy['accuracy']} accuracy")print_in_one_process("Printing with delay in process 0")print_in_process_0("End of process 0")print_in_process_1("End of process 1")accelerator.wait_for_everyone()print_in_one_process("End of script")
Overwriting accelerate_scripts/08_accelerate_base_code_sync_all_process.py
We run it
!accelerate launch accelerate_scripts/09_accelerate_base_code_sync_all_process.py
Map: 100%|████████████████████████| 5000/5000 [00:00<00:00, 14218.23 examples/s]Map: 100%|████████████████████████| 5000/5000 [00:00<00:00, 14666.25 examples/s]0%| | 0/176 [00:00<?, ?it/s]100%|███████████████████████████████████████████| 20/20 [00:05<00:00, 3.58it/s]Process 1: End of process 1Accuracy = 0.212End of script with 0.212 accuracyAll process: Accuracy = 0.212All process: End of script with 0.212 accuracyPrinting with delay in process 0Process 0: End of process 0End of script
As you can see first we have printed Process 1: End of process 1
and then the rest, this is because the rest of the prints are made either in process 0 or in all processes, so until the 2 seconds delay we have set is not finished the rest of the code is not executed.
Save and load the state dict
When we train, we sometimes save the state so that we can continue at a later time.
To save the state we will have to use the save_state()
and load_state()
methods.
%%writefile accelerate_scripts/10_accelerate_save_and_load_checkpoints.pyimport torchfrom torch.utils.data import DataLoaderfrom torch.optim import Adamfrom datasets import load_datasetfrom transformers import AutoTokenizer, AutoModelForSequenceClassificationimport evaluateimport tqdm# Importamos e inicializamos Acceleratorfrom accelerate import Acceleratoraccelerator = Accelerator()dataset = load_dataset("tweet_eval", "emoji")num_classes = len(dataset["train"].info.features["label"].names)max_len = 130checkpoints = "cardiffnlp/twitter-roberta-base-irony"tokenizer = AutoTokenizer.from_pretrained(checkpoints)def tokenize_function(dataset):return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")tokenized_dataset = {"train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),"validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),"test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),}tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])BS = 128dataloader = {"train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),"validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),"test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),}model = AutoModelForSequenceClassification.from_pretrained(checkpoints)model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)loss_function = torch.nn.CrossEntropyLoss()optimizer = Adam(model.parameters(), lr=5e-4)metric = evaluate.load("accuracy")@accelerator.on_local_main_processdef print_something(something):print(something)EPOCHS = 1# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")device = accelerator.device# model.to(device)model, optimizer, dataloader["train"], dataloader["validation"] = accelerator.prepare(model, optimizer, dataloader["train"], dataloader["validation"])for i in range(EPOCHS):model.train()progress_bar_train = tqdm.tqdm(dataloader["train"], disable=not accelerator.is_local_main_process)for batch in progress_bar_train:optimizer.zero_grad()input_ids = batch["input_ids"]#.to(device)attention_mask = batch["attention_mask"]#.to(device)labels = batch["label"]#.to(device)outputs = model(input_ids=input_ids, attention_mask=attention_mask)loss = loss_function(outputs['logits'], labels)# loss.backward()accelerator.backward(loss)optimizer.step()model.eval()progress_bar_validation = tqdm.tqdm(dataloader["validation"], disable=not accelerator.is_local_main_process)for batch in progress_bar_validation:input_ids = batch["input_ids"]#.to(device)attention_mask = batch["attention_mask"]#.to(device)labels = batch["label"]#.to(device)with torch.no_grad():outputs = model(input_ids=input_ids, attention_mask=attention_mask)predictions = torch.argmax(outputs['logits'], axis=-1)# Recopilamos las predicciones de todos los dispositivospredictions = accelerator.gather_for_metrics(predictions)labels = accelerator.gather_for_metrics(labels)accuracy = metric.add_batch(predictions=predictions, references=labels)accuracy = metric.compute()# Guardamos los pesosaccelerator.save_state("accelerate_scripts/checkpoints")print_something(f"Accuracy = {accuracy['accuracy']}")# Cargamos los pesosaccelerator.load_state("accelerate_scripts/checkpoints")
Overwriting accelerate_scripts/09_accelerate_save_and_load_checkpoints.py
We run it
!accelerate launch accelerate_scripts/10_accelerate_save_and_load_checkpoints.py
100%|█████████████████████████████████████████| 176/176 [01:58<00:00, 1.48it/s]100%|███████████████████████████████████████████| 20/20 [00:05<00:00, 3.40it/s]Accuracy = 0.2142
Save the model
When the prepare
method was used, the model was wrapped in order to save it to the necessary devices. So when saving it we have to use the save_model
method which first unwraps it and then saves it. Also if we use the safe_serialization=True
parameter the model will be saved as a safe tensor
.
%%writefile accelerate_scripts/11_accelerate_save_model.pyimport torchfrom torch.utils.data import DataLoaderfrom torch.optim import Adamfrom datasets import load_datasetfrom transformers import AutoTokenizer, AutoModelForSequenceClassificationimport evaluateimport tqdm# Importamos e inicializamos Acceleratorfrom accelerate import Acceleratoraccelerator = Accelerator()dataset = load_dataset("tweet_eval", "emoji")num_classes = len(dataset["train"].info.features["label"].names)max_len = 130checkpoints = "cardiffnlp/twitter-roberta-base-irony"tokenizer = AutoTokenizer.from_pretrained(checkpoints)def tokenize_function(dataset):return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")tokenized_dataset = {"train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),"validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),"test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),}tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])BS = 128dataloader = {"train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),"validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),"test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),}model = AutoModelForSequenceClassification.from_pretrained(checkpoints)model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)loss_function = torch.nn.CrossEntropyLoss()optimizer = Adam(model.parameters(), lr=5e-4)metric = evaluate.load("accuracy")@accelerator.on_local_main_processdef print_something(something):print(something)EPOCHS = 1# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")device = accelerator.device# model.to(device)model, optimizer, dataloader["train"], dataloader["validation"] = accelerator.prepare(model, optimizer, dataloader["train"], dataloader["validation"])for i in range(EPOCHS):model.train()progress_bar_train = tqdm.tqdm(dataloader["train"], disable=not accelerator.is_local_main_process)for batch in progress_bar_train:optimizer.zero_grad()input_ids = batch["input_ids"]#.to(device)attention_mask = batch["attention_mask"]#.to(device)labels = batch["label"]#.to(device)outputs = model(input_ids=input_ids, attention_mask=attention_mask)loss = loss_function(outputs['logits'], labels)# loss.backward()accelerator.backward(loss)optimizer.step()model.eval()progress_bar_validation = tqdm.tqdm(dataloader["validation"], disable=not accelerator.is_local_main_process)for batch in progress_bar_validation:input_ids = batch["input_ids"]#.to(device)attention_mask = batch["attention_mask"]#.to(device)labels = batch["label"]#.to(device)with torch.no_grad():outputs = model(input_ids=input_ids, attention_mask=attention_mask)predictions = torch.argmax(outputs['logits'], axis=-1)# Recopilamos las predicciones de todos los dispositivospredictions = accelerator.gather_for_metrics(predictions)labels = accelerator.gather_for_metrics(labels)accuracy = metric.add_batch(predictions=predictions, references=labels)accuracy = metric.compute()# Guardamos el modeloaccelerator.wait_for_everyone()accelerator.save_model(model, "accelerate_scripts/model", safe_serialization=True)print_something(f"Accuracy = {accuracy['accuracy']}")
Writing accelerate_scripts/11_accelerate_save_model.py
We run it
!accelerate launch accelerate_scripts/11_accelerate_save_model.py
100%|█████████████████████████████████████████| 176/176 [01:58<00:00, 1.48it/s]100%|███████████████████████████████████████████| 20/20 [00:05<00:00, 3.35it/s]Accuracy = 0.214
Save the pretrained
model
In models that use the transformers
library we must save the model with the save_pretrained
method to be able to load it with the from_pretrained
method. Before saving it we must unwrap it with the unwrap_model
method.
%%writefile accelerate_scripts/12_accelerate_save_pretrained.pyimport torchfrom torch.utils.data import DataLoaderfrom torch.optim import Adamfrom datasets import load_datasetfrom transformers import AutoTokenizer, AutoModelForSequenceClassificationimport evaluateimport tqdm# Importamos e inicializamos Acceleratorfrom accelerate import Acceleratoraccelerator = Accelerator()dataset = load_dataset("tweet_eval", "emoji")num_classes = len(dataset["train"].info.features["label"].names)max_len = 130checkpoints = "cardiffnlp/twitter-roberta-base-irony"tokenizer = AutoTokenizer.from_pretrained(checkpoints)def tokenize_function(dataset):return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")tokenized_dataset = {"train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),"validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),"test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),}tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])BS = 128dataloader = {"train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),"validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),"test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),}model = AutoModelForSequenceClassification.from_pretrained(checkpoints)model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)loss_function = torch.nn.CrossEntropyLoss()optimizer = Adam(model.parameters(), lr=5e-4)metric = evaluate.load("accuracy")@accelerator.on_local_main_processdef print_something(something):print(something)EPOCHS = 1# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")device = accelerator.device# model.to(device)model, optimizer, dataloader["train"], dataloader["validation"] = accelerator.prepare(model, optimizer, dataloader["train"], dataloader["validation"])for i in range(EPOCHS):model.train()progress_bar_train = tqdm.tqdm(dataloader["train"], disable=not accelerator.is_local_main_process)for batch in progress_bar_train:optimizer.zero_grad()input_ids = batch["input_ids"]#.to(device)attention_mask = batch["attention_mask"]#.to(device)labels = batch["label"]#.to(device)outputs = model(input_ids=input_ids, attention_mask=attention_mask)loss = loss_function(outputs['logits'], labels)# loss.backward()accelerator.backward(loss)optimizer.step()model.eval()progress_bar_validation = tqdm.tqdm(dataloader["validation"], disable=not accelerator.is_local_main_process)for batch in progress_bar_validation:input_ids = batch["input_ids"]#.to(device)attention_mask = batch["attention_mask"]#.to(device)labels = batch["label"]#.to(device)with torch.no_grad():outputs = model(input_ids=input_ids, attention_mask=attention_mask)predictions = torch.argmax(outputs['logits'], axis=-1)# Recopilamos las predicciones de todos los dispositivospredictions = accelerator.gather_for_metrics(predictions)labels = accelerator.gather_for_metrics(labels)accuracy = metric.add_batch(predictions=predictions, references=labels)accuracy = metric.compute()# Guardamos el modelo pretrainedunwrapped_model = accelerator.unwrap_model(model)unwrapped_model.save_pretrained("accelerate_scripts/model_pretrained",is_main_process=accelerator.is_main_process,save_function=accelerator.save,)print_something(f"Accuracy = {accuracy['accuracy']}")
Writing accelerate_scripts/11_accelerate_save_pretrained.py
We run it
!accelerate launch accelerate_scripts/12_accelerate_save_pretrained.py
Map: 100%|██████████████████████| 45000/45000 [00:02<00:00, 15152.47 examples/s]Map: 100%|██████████████████████| 45000/45000 [00:02<00:00, 15119.13 examples/s]Map: 100%|████████████████████████| 5000/5000 [00:00<00:00, 12724.70 examples/s]Map: 100%|████████████████████████| 5000/5000 [00:00<00:00, 12397.49 examples/s]Map: 100%|██████████████████████| 50000/50000 [00:03<00:00, 15247.21 examples/s]Map: 100%|██████████████████████| 50000/50000 [00:03<00:00, 15138.03 examples/s]100%|█████████████████████████████████████████| 176/176 [01:59<00:00, 1.48it/s]100%|███████████████████████████████████████████| 20/20 [00:05<00:00, 3.37it/s]Accuracy = 0.21
Now we could load it
from transformers import AutoModelcheckpoints = "accelerate_scripts/model_pretrained"tokenizer = AutoModel.from_pretrained(checkpoints)
Some weights of RobertaModel were not initialized from the model checkpoint at accelerate_scripts/model_pretrained and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Training on notebooks
So far we have seen how to run scripts, but if you want to run the code on a notebook, we can write the same code as before, but encapsulated in a function
First we import the libraries
import torchfrom torch.utils.data import DataLoaderfrom torch.optim import Adamfrom datasets import load_datasetfrom transformers import AutoTokenizer, AutoModelForSequenceClassificationimport evaluateimport tqdmimport time# from accelerate import Accelerator
Now we create the function
import torchfrom torch.utils.data import DataLoaderfrom torch.optim import Adamfrom datasets import load_datasetfrom transformers import AutoTokenizer, AutoModelForSequenceClassificationimport evaluateimport tqdmimport time# from accelerate import Acceleratordef train_code(batch_size: int = 64):from accelerate import Acceleratoraccelerator = Accelerator()dataset = load_dataset("tweet_eval", "emoji")num_classes = len(dataset["train"].info.features["label"].names)max_len = 130checkpoints = "cardiffnlp/twitter-roberta-base-irony"tokenizer = AutoTokenizer.from_pretrained(checkpoints)def tokenize_function(dataset):return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")tokenized_dataset = {"train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),"validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),"test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),}tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])BS = batch_sizedataloader = {"train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),"validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),"test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),}model = AutoModelForSequenceClassification.from_pretrained(checkpoints)model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)loss_function = torch.nn.CrossEntropyLoss()optimizer = Adam(model.parameters(), lr=5e-4)metric = evaluate.load("accuracy")EPOCHS = 1# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")device = accelerator.device# model.to(device)model, optimizer, dataloader["train"], dataloader["validation"] = accelerator.prepare(model, optimizer, dataloader["train"], dataloader["validation"])for i in range(EPOCHS):model.train()progress_bar_train = tqdm.tqdm(dataloader["train"], disable=not accelerator.is_local_main_process)for batch in progress_bar_train:optimizer.zero_grad()input_ids = batch["input_ids"]#.to(device)attention_mask = batch["attention_mask"]#.to(device)labels = batch["label"]#.to(device)outputs = model(input_ids=input_ids, attention_mask=attention_mask)loss = loss_function(outputs['logits'], labels)# loss.backward()accelerator.backward(loss)optimizer.step()model.eval()progress_bar_validation = tqdm.tqdm(dataloader["validation"], disable=not accelerator.is_local_main_process)for batch in progress_bar_validation:input_ids = batch["input_ids"]#.to(device)attention_mask = batch["attention_mask"]#.to(device)labels = batch["label"]#.to(device)with torch.no_grad():outputs = model(input_ids=input_ids, attention_mask=attention_mask)predictions = torch.argmax(outputs['logits'], axis=-1)# Recopilamos las predicciones de todos los dispositivospredictions = accelerator.gather_for_metrics(predictions)labels = accelerator.gather_for_metrics(labels)accuracy = metric.add_batch(predictions=predictions, references=labels)accuracy = metric.compute()accelerator.print(f"Accuracy = {accuracy['accuracy']}")
In order to run the training on the notebook we use the notebook_launcher
function, to which we pass the function we want to run, the arguments of that function and the number of GPUs on which we are going to train with the variable num_processes
.
from accelerate import notebook_launcher
args = (128,)
notebook_launcher(train_code, args, num_processes=2)
Training in FP16
When we first set up accelerate
it asked us Do you wish to use FP16 or BF16 (mixed precision)?
and we said no, so now we are going to say yes, we want to use FP16.
So far we have trained in FP32, which means that each weight of the model is a 32-bit floating point number, and now we are going to use a 16-bit floating point number, that is, the model will occupy less space. So two things will happen, we will be able to use a larger batch size and it will also be faster.
First we re-launch accelerate config
and we will tell it that we want FP16
import torchfrom torch.utils.data import DataLoaderfrom torch.optim import Adamfrom datasets import load_datasetfrom transformers import AutoTokenizer, AutoModelForSequenceClassificationimport evaluateimport tqdmimport time# from accelerate import Acceleratordef train_code(batch_size: int = 64):from accelerate import Acceleratoraccelerator = Accelerator()dataset = load_dataset("tweet_eval", "emoji")num_classes = len(dataset["train"].info.features["label"].names)max_len = 130checkpoints = "cardiffnlp/twitter-roberta-base-irony"tokenizer = AutoTokenizer.from_pretrained(checkpoints)def tokenize_function(dataset):return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")tokenized_dataset = {"train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),"validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),"test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),}tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])BS = batch_sizedataloader = {"train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),"validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),"test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),}model = AutoModelForSequenceClassification.from_pretrained(checkpoints)model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)loss_function = torch.nn.CrossEntropyLoss()optimizer = Adam(model.parameters(), lr=5e-4)metric = evaluate.load("accuracy")EPOCHS = 1# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")device = accelerator.device# model.to(device)model, optimizer, dataloader["train"], dataloader["validation"] = accelerator.prepare(model, optimizer, dataloader["train"], dataloader["validation"])for i in range(EPOCHS):model.train()progress_bar_train = tqdm.tqdm(dataloader["train"], disable=not accelerator.is_local_main_process)for batch in progress_bar_train:optimizer.zero_grad()input_ids = batch["input_ids"]#.to(device)attention_mask = batch["attention_mask"]#.to(device)labels = batch["label"]#.to(device)outputs = model(input_ids=input_ids, attention_mask=attention_mask)loss = loss_function(outputs['logits'], labels)# loss.backward()accelerator.backward(loss)optimizer.step()model.eval()progress_bar_validation = tqdm.tqdm(dataloader["validation"], disable=not accelerator.is_local_main_process)for batch in progress_bar_validation:input_ids = batch["input_ids"]#.to(device)attention_mask = batch["attention_mask"]#.to(device)labels = batch["label"]#.to(device)with torch.no_grad():outputs = model(input_ids=input_ids, attention_mask=attention_mask)predictions = torch.argmax(outputs['logits'], axis=-1)# Recopilamos las predicciones de todos los dispositivospredictions = accelerator.gather_for_metrics(predictions)labels = accelerator.gather_for_metrics(labels)accuracy = metric.add_batch(predictions=predictions, references=labels)accuracy = metric.compute()accelerator.print(f"Accuracy = {accuracy['accuracy']}")from accelerate import notebook_launcherargs = (128,)notebook_launcher(train_code, args, num_processes=2)!accelerate config
Launching training on 2 GPUs.--------------------------------------------------------------------------------In which compute environment are you running?This machine--------------------------------------------------------------------------------multi-GPUHow many different machines will you use (use more than 1 for multi-node training)? [1]: 1Should distributed operations be checked while running for errors? This can avoid timeout issues but will be slower. [yes/NO]: noDo you wish to optimize your script with torch dynamo?[yes/NO]:noDo you want to use DeepSpeed? [yes/NO]: noDo you want to use FullyShardedDataParallel? [yes/NO]: noDo you want to use Megatron-LM ? [yes/NO]: noHow many GPU(s) should be used for distributed training? [1]:2What GPU(s) (by id) should be used for training on this machine as a comma-seperated list? [all]:0,1--------------------------------------------------------------------------------Do you wish to use FP16 or BF16 (mixed precision)?fp16accelerate configuration saved at ~/.cache/huggingface/accelerate/default_config.yaml
Now we create a script to train, with the same batch size as before, to see if it takes less time to train
%%writefile accelerate_scripts/13_accelerate_base_code_fp16_bs128.pyimport torchfrom torch.utils.data import DataLoaderfrom torch.optim import Adamfrom datasets import load_datasetfrom transformers import AutoTokenizer, AutoModelForSequenceClassificationimport evaluateimport tqdm# Importamos e inicializamos Acceleratorfrom accelerate import Acceleratoraccelerator = Accelerator()dataset = load_dataset("tweet_eval", "emoji")num_classes = len(dataset["train"].info.features["label"].names)max_len = 130checkpoints = "cardiffnlp/twitter-roberta-base-irony"tokenizer = AutoTokenizer.from_pretrained(checkpoints)def tokenize_function(dataset):return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")tokenized_dataset = {"train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),"validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),"test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),}tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])BS = 128dataloader = {"train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),"validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),"test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),}model = AutoModelForSequenceClassification.from_pretrained(checkpoints)model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)loss_function = torch.nn.CrossEntropyLoss()optimizer = Adam(model.parameters(), lr=5e-4)metric = evaluate.load("accuracy")EPOCHS = 1# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")device = accelerator.device# model.to(device)model, optimizer, dataloader["train"], dataloader["validation"] = accelerator.prepare(model, optimizer, dataloader["train"], dataloader["validation"])for i in range(EPOCHS):model.train()progress_bar_train = tqdm.tqdm(dataloader["train"], disable=not accelerator.is_local_main_process)for batch in progress_bar_train:optimizer.zero_grad()input_ids = batch["input_ids"]#.to(device)attention_mask = batch["attention_mask"]#.to(device)labels = batch["label"]#.to(device)outputs = model(input_ids=input_ids, attention_mask=attention_mask)loss = loss_function(outputs['logits'], labels)# loss.backward()accelerator.backward(loss)optimizer.step()model.eval()progress_bar_validation = tqdm.tqdm(dataloader["validation"], disable=not accelerator.is_local_main_process)for batch in progress_bar_validation:input_ids = batch["input_ids"]#.to(device)attention_mask = batch["attention_mask"]#.to(device)labels = batch["label"]#.to(device)with torch.no_grad():outputs = model(input_ids=input_ids, attention_mask=attention_mask)predictions = torch.argmax(outputs['logits'], axis=-1)# Recopilamos las predicciones de todos los dispositivospredictions = accelerator.gather_for_metrics(predictions)labels = accelerator.gather_for_metrics(labels)accuracy = metric.add_batch(predictions=predictions, references=labels)accuracy = metric.compute()accelerator.print(f"Accuracy = {accuracy['accuracy']}")
Overwriting accelerate_scripts/12_accelerate_base_code_fp16_bs128.py
We run it and see how long it takes
%%time!accelerate launch accelerate_scripts/13_accelerate_base_code_fp16_bs128.py
Map: 100%|████████████████████████| 5000/5000 [00:00<00:00, 14983.76 examples/s]Map: 100%|██████████████████████| 50000/50000 [00:03<00:00, 14315.47 examples/s]100%|█████████████████████████████████████████| 176/176 [01:01<00:00, 2.88it/s]100%|███████████████████████████████████████████| 20/20 [00:02<00:00, 6.84it/s]Accuracy = 0.2094CPU times: user 812 ms, sys: 163 ms, total: 976 msWall time: 1min 27s
When we ran this training in FP32 it took about 2 minutes and a half, and now it takes about 1 minute and a half. Let's see if now instead of training with a batch size of 128, we do it with a batch size of 256.
%%writefile accelerate_scripts/14_accelerate_base_code_fp16_bs256.pyimport torchfrom torch.utils.data import DataLoaderfrom torch.optim import Adamfrom datasets import load_datasetfrom transformers import AutoTokenizer, AutoModelForSequenceClassificationimport evaluateimport tqdm# Importamos e inicializamos Acceleratorfrom accelerate import Acceleratoraccelerator = Accelerator()dataset = load_dataset("tweet_eval", "emoji")num_classes = len(dataset["train"].info.features["label"].names)max_len = 130checkpoints = "cardiffnlp/twitter-roberta-base-irony"tokenizer = AutoTokenizer.from_pretrained(checkpoints)def tokenize_function(dataset):return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")tokenized_dataset = {"train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),"validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),"test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),}tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])BS = 256dataloader = {"train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),"validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),"test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),}model = AutoModelForSequenceClassification.from_pretrained(checkpoints)model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)loss_function = torch.nn.CrossEntropyLoss()optimizer = Adam(model.parameters(), lr=5e-4)metric = evaluate.load("accuracy")EPOCHS = 1# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")device = accelerator.device# model.to(device)model, optimizer, dataloader["train"], dataloader["validation"] = accelerator.prepare(model, optimizer, dataloader["train"], dataloader["validation"])for i in range(EPOCHS):model.train()progress_bar_train = tqdm.tqdm(dataloader["train"], disable=not accelerator.is_local_main_process)for batch in progress_bar_train:optimizer.zero_grad()input_ids = batch["input_ids"]#.to(device)attention_mask = batch["attention_mask"]#.to(device)labels = batch["label"]#.to(device)outputs = model(input_ids=input_ids, attention_mask=attention_mask)loss = loss_function(outputs['logits'], labels)# loss.backward()accelerator.backward(loss)optimizer.step()model.eval()progress_bar_validation = tqdm.tqdm(dataloader["validation"], disable=not accelerator.is_local_main_process)for batch in progress_bar_validation:input_ids = batch["input_ids"]#.to(device)attention_mask = batch["attention_mask"]#.to(device)labels = batch["label"]#.to(device)with torch.no_grad():outputs = model(input_ids=input_ids, attention_mask=attention_mask)predictions = torch.argmax(outputs['logits'], axis=-1)# Recopilamos las predicciones de todos los dispositivospredictions = accelerator.gather_for_metrics(predictions)labels = accelerator.gather_for_metrics(labels)accuracy = metric.add_batch(predictions=predictions, references=labels)accuracy = metric.compute()accelerator.print(f"Accuracy = {accuracy['accuracy']}")
Overwriting accelerate_scripts/13_accelerate_base_code_fp16_bs256.py
We run it
%%time!accelerate launch accelerate_scripts/14_accelerate_base_code_fp16_bs256.py
Map: 100%|████████████████████████| 5000/5000 [00:00<00:00, 15390.30 examples/s]Map: 100%|████████████████████████| 5000/5000 [00:00<00:00, 14990.92 examples/s]100%|███████████████████████████████████████████| 88/88 [00:54<00:00, 1.62it/s]100%|███████████████████████████████████████████| 10/10 [00:02<00:00, 3.45it/s]Accuracy = 0.2236CPU times: user 670 ms, sys: 91.6 ms, total: 761 msWall time: 1min 12s
It has dropped only about 15 seconds
BF16 Training
Before we have trained in FP16 and now we are going to train in BF16, what is the difference?
As we can see in the picture, while FP16 compared to FP32 has fewer bits in the mantissa and exponent, which makes its range much smaller, BF16 compared to FP32 has the same number of bits in the exponent but fewer in the mantissa, which makes BF16 have the same range of numbers as FP32, but it is less accurate.
This is beneficial because in FP16 some calculations could give very high numbers, which in FP16 format could not be represented. In addition there are certain HW devices that are optimized for this format.
As before, we execute accelerate config
and indicate that we want BF16.
!accelerate config
--------------------------------------------------------------------------------In which compute environment are you running?This machine--------------------------------------------------------------------------------multi-GPUHow many different machines will you use (use more than 1 for multi-node training)? [1]: 1Should distributed operations be checked while running for errors? This can avoid timeout issues but will be slower. [yes/NO]: noDo you wish to optimize your script with torch dynamo?[yes/NO]:noDo you want to use DeepSpeed? [yes/NO]: noDo you want to use FullyShardedDataParallel? [yes/NO]: noDo you want to use Megatron-LM ? [yes/NO]: noHow many GPU(s) should be used for distributed training? [1]:2What GPU(s) (by id) should be used for training on this machine as a comma-seperated list? [all]:0,1--------------------------------------------------------------------------------Do you wish to use FP16 or BF16 (mixed precision)?bf16accelerate configuration saved at ~/.cache/huggingface/accelerate/default_config.yaml
Now we run the last script we created, i.e. with a batch size of 256
%%time!accelerate launch accelerate_scripts/14_accelerate_base_code_fp16_bs256.py
Map: 100%|██████████████████████| 50000/50000 [00:03<00:00, 14814.95 examples/s]Map: 100%|██████████████████████| 50000/50000 [00:03<00:00, 14506.83 examples/s]100%|███████████████████████████████████████████| 88/88 [00:51<00:00, 1.70it/s]100%|███████████████████████████████████████████| 10/10 [00:03<00:00, 3.21it/s]Accuracy = 0.2112CPU times: user 688 ms, sys: 144 ms, total: 832 msWall time: 1min 17s
It took a similar time to what it took before, which is normal, since we have trained a model with 16-bit weights, just like before.
Training in FP8
Now we are going to train in FP8 format, which as its name suggests, is a floating point format, where each weight has 8 bits, so we run accelerate config
to tell it that we want FP8
!accelerate config
--------------------------------------------------------------------------------In which compute environment are you running?This machine--------------------------------------------------------------------------------multi-GPUHow many different machines will you use (use more than 1 for multi-node training)? [1]: 1Should distributed operations be checked while running for errors? This can avoid timeout issues but will be slower. [yes/NO]: noDo you wish to optimize your script with torch dynamo?[yes/NO]:noDo you want to use DeepSpeed? [yes/NO]: noDo you want to use FullyShardedDataParallel? [yes/NO]: noDo you want to use Megatron-LM ? [yes/NO]: noHow many GPU(s) should be used for distributed training? [1]:2What GPU(s) (by id) should be used for training on this machine as a comma-seperated list? [all]:0,1--------------------------------------------------------------------------------Do you wish to use FP16 or BF16 (mixed precision)?fp8accelerate configuration saved at ~/.cache/huggingface/accelerate/default_config.yaml
Now we execute the last script, the batch size of 256
%%time!accelerate launch accelerate_scripts/14_accelerate_base_code_fp16_bs256.py
Traceback (most recent call last):File "/home/wallabot/Documentos/web/portafolio/posts/accelerate_scripts/13_accelerate_base_code_fp16_bs256.py", line 12, in <module>accelerator = Accelerator()^^^^^^^^^^^^^File "/home/wallabot/miniconda3/envs/nlp/lib/python3.11/site-packages/accelerate/accelerator.py", line 371, in __init__self.state = AcceleratorState(^^^^^^^^^^^^^^^^^File "/home/wallabot/miniconda3/envs/nlp/lib/python3.11/site-packages/accelerate/state.py", line 790, in __init__raise ValueError(ValueError: Using `fp8` precision requires `transformer_engine` or `MS-AMP` to be installed.Traceback (most recent call last):File "/home/wallabot/Documentos/web/portafolio/posts/accelerate_scripts/13_accelerate_base_code_fp16_bs256.py", line 12, in <module>accelerator = Accelerator()^^^^^^^^^^^^^File "/home/wallabot/miniconda3/envs/nlp/lib/python3.11/site-packages/accelerate/accelerator.py", line 371, in __init__self.state = AcceleratorState(^^^^^^^^^^^^^^^^^File "/home/wallabot/miniconda3/envs/nlp/lib/python3.11/site-packages/accelerate/state.py", line 790, in __init__raise ValueError(ValueError: Using `fp8` precision requires `transformer_engine` or `MS-AMP` to be installed.[2024-05-13 21:40:56,455] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 501480) of binary: /home/wallabot/miniconda3/envs/nlp/bin/pythonTraceback (most recent call last):File "/home/wallabot/miniconda3/envs/nlp/bin/accelerate", line 8, in <module>sys.exit(main())^^^^^^File "/home/wallabot/miniconda3/envs/nlp/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 46, in mainargs.func(args)File "/home/wallabot/miniconda3/envs/nlp/lib/python3.11/site-packages/accelerate/commands/launch.py", line 1048, in launch_commandmulti_gpu_launcher(args)File "/home/wallabot/miniconda3/envs/nlp/lib/python3.11/site-packages/accelerate/commands/launch.py", line 702, in multi_gpu_launcherdistrib_run.run(args)File "/home/wallabot/miniconda3/envs/nlp/lib/python3.11/site-packages/torch/distributed/run.py", line 803, in runelastic_launch(File "/home/wallabot/miniconda3/envs/nlp/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 135, in __call__return launch_agent(self._config, self._entrypoint, list(args))^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File "/home/wallabot/miniconda3/envs/nlp/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agentraise ChildFailedError(torch.distributed.elastic.multiprocessing.errors.ChildFailedError:============================================================accelerate_scripts/13_accelerate_base_code_fp16_bs256.py FAILED------------------------------------------------------------Failures:[1]:time : 2024-05-13_21:40:56host : wallabotrank : 1 (local_rank: 1)exitcode : 1 (pid: 501481)error_file: <N/A>traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html------------------------------------------------------------Root Cause (first observed failure):[0]:time : 2024-05-13_21:40:56host : wallabotrank : 0 (local_rank: 0)exitcode : 1 (pid: 501480)error_file: <N/A>traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html============================================================CPU times: user 65.1 ms, sys: 14.5 ms, total: 79.6 msWall time: 7.24 s
As the weights are now 8 bits and occupy half of the memory, we will increase the batch size to 512.
%%writefile accelerate_scripts/15_accelerate_base_code_fp8_bs512.pyimport torchfrom torch.utils.data import DataLoaderfrom torch.optim import Adamfrom datasets import load_datasetfrom transformers import AutoTokenizer, AutoModelForSequenceClassificationimport evaluateimport tqdm# Importamos e inicializamos Acceleratorfrom accelerate import Acceleratoraccelerator = Accelerator()dataset = load_dataset("tweet_eval", "emoji")num_classes = len(dataset["train"].info.features["label"].names)max_len = 130checkpoints = "cardiffnlp/twitter-roberta-base-irony"tokenizer = AutoTokenizer.from_pretrained(checkpoints)def tokenize_function(dataset):return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")tokenized_dataset = {"train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),"validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),"test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),}tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])BS = 512dataloader = {"train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),"validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),"test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),}model = AutoModelForSequenceClassification.from_pretrained(checkpoints)model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)loss_function = torch.nn.CrossEntropyLoss()optimizer = Adam(model.parameters(), lr=5e-4)metric = evaluate.load("accuracy")EPOCHS = 1# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")device = accelerator.device# model.to(device)model, optimizer, dataloader["train"], dataloader["validation"] = accelerator.prepare(model, optimizer, dataloader["train"], dataloader["validation"])for i in range(EPOCHS):model.train()progress_bar_train = tqdm.tqdm(dataloader["train"], disable=not accelerator.is_local_main_process)for batch in progress_bar_train:optimizer.zero_grad()input_ids = batch["input_ids"]#.to(device)attention_mask = batch["attention_mask"]#.to(device)labels = batch["label"]#.to(device)outputs = model(input_ids=input_ids, attention_mask=attention_mask)loss = loss_function(outputs['logits'], labels)# loss.backward()accelerator.backward(loss)optimizer.step()model.eval()progress_bar_validation = tqdm.tqdm(dataloader["validation"], disable=not accelerator.is_local_main_process)for batch in progress_bar_validation:input_ids = batch["input_ids"]#.to(device)attention_mask = batch["attention_mask"]#.to(device)labels = batch["label"]#.to(device)with torch.no_grad():outputs = model(input_ids=input_ids, attention_mask=attention_mask)predictions = torch.argmax(outputs['logits'], axis=-1)# Recopilamos las predicciones de todos los dispositivospredictions = accelerator.gather_for_metrics(predictions)labels = accelerator.gather_for_metrics(labels)accuracy = metric.add_batch(predictions=predictions, references=labels)accuracy = metric.compute()accelerator.print(f"Accuracy = {accuracy['accuracy']}")
Writing accelerate_scripts/15_accelerate_base_code_fp8_bs512.py
We run it
%%time!accelerate launch accelerate_scripts/15_accelerate_base_code_fp8_bs512.py
Model inference
Using the Hugging Face Ecosystem
Let's see how to do large model inference with the transformers
library of hugging face.
Inference with pipeline
.
If we use the Hugging Face ecosystem it is very simple, since everything is produced underneath without us having to do much. In the case of using pipeline
, which is the easiest way to do inference with the transformers
library, we simply have to tell it the model we want to use and very important, pass device_map="auto"
. This will cause accelerate
to distribute the model among the different GPUs, CPU RAM or hard disk if necessary.
There are more possible values for device_map
, which we will see later, but for now stay with "auto"
.
We are going to use the Llama3 8B
model, which as its name indicates is a model of about 8 billion parameters, as each parameter by default is in FP32 format, which corresponds to 4 bytes (32 bits), that means that if we multiply 8 billion parameters by 4 bytes, we would need a GPU with about 32 GB of VRAM.
In my case I have 2 GPUs with 24 GB of VRAM, so it would not fit on a single GPU. But thanks to put device_map="auto"
, accelerate will distribute the model between the two GPUs and I will be able to make the inference.
%%time!accelerate launch accelerate_scripts/15_accelerate_base_code_fp8_bs512.py%%writefile accelerate_scripts/16_inference_with_pipeline.pyfrom transformers import pipelinecheckpoints = "meta-llama/Meta-Llama-3-8B-Instruct"generator = pipeline(model=checkpoints, device_map="auto")prompt = "Conoces accelerate de hugging face?"output = generator(prompt)print(output)
Overwriting accelerate_scripts/09_inference_with_pipeline.py
Now we run it, but since pipeline uses accelerate below, we don't need to run it with accelerate launch script.py
but with python script.py
will do.
!python accelerate_scripts/16_inference_with_pipeline.py
Loading checkpoint shards: 100%|██████████████████| 4/4 [00:09<00:00, 2.27s/it]Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.[{'generated_text': 'Conoces accelerate de hugging face? ¿Qué es el modelo de lenguaje de transformers y cómo se utiliza en el marco de hugging face? ¿Cómo puedo utilizar modelos de lenguaje de transformers en mi aplicación? ¿Qué son los tokenizers y cómo se utilizan en el marco de hugging face? ¿Cómo puedo crear un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los datasets y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar datasets para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los evaluadores y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar evaluadores para evaluar el rendimiento de un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los pre-trainados y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar pre-trainados para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los finetuning y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar finetuning para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los checkpoints y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar checkpoints para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los evaluadores y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar evaluadores para evaluar el rendimiento de un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los pre-trainados y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar pre-trainados para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los finetuning y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar finetuning para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los checkpoints y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar checkpoints para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los evaluadores y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar evaluadores para evaluar el rendimiento de un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los pre-trainados y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar pre-trainados para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los finetuning y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar finetuning para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los checkpoints y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar checkpoints para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los evaluadores y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar evaluadores para evaluar el rendimiento de un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los pre-trainados y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar pre-trainados para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los finetuning y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar finetuning para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los checkpoints y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar checkpoints para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los evaluadores y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar evaluadores para evaluar el rendimiento de un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los pre-trainados y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar pre-trainados para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los finetuning y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar finetuning para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los checkpoints y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar checkpoints para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los evaluadores y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar evaluadores para evaluar el rendimiento de un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los pre-trainados y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar pre-trainados para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los finetuning y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar finetuning para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los checkpoints y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar checkpoints para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los evaluadores y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar evaluadores para evaluar el rendimiento de un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los pre-trainados y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar pre-trainados para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los finetuning y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar finetuning para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los checkpoints y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar checkpoints para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los evaluadores y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar evaluadores para evaluar el rendimiento de un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los pre-trainados y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar pre-trainados para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los finetuning y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar finetuning para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los checkpoints y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar checkpoints para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los evaluadores y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar evaluadores para evaluar el rendimiento de un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los pre-trainados y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar pre-trainados para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los finetuning y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar finetuning para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los checkpoints y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar checkpoints para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los evaluadores y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar evaluadores para evaluar el rendimiento de un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los pre-trainados y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar pre-trainados para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los finetuning y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar finetuning para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los checkpoints y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar checkpoints para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los evaluadores y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar evaluadores para evaluar el rendimiento de un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los pre-trainados y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar pre-trainados para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los finetuning y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar finetuning para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los checkpoints y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar checkpoints para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los evaluadores y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar evaluadores para evaluar el rendimiento de un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los pre-trainados y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar pre-trainados para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los finetuning y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar finetuning para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los checkpoints y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar checkpoints para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los evaluadores y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar evaluadores para evaluar el rendimiento de un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los pre-trainados y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar pre-trainados para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los finetuning y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar finetuning para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los checkpoints y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar checkpoints para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los evaluadores y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar evaluadores para evaluar el rendimiento de un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los pre-trainados y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar pre-trainados para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los finetuning y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar finetuning para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los checkpoints y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar checkpoints para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los evaluadores y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar evaluadores para evaluar el rendimiento de un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los pre-trainados y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar pre-trainados para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los finetuning y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar finetuning para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los checkpoints y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar checkpoints para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los evaluadores y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar evaluadores para evaluar el rendimiento de un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los pre-trainados y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar pre-trainados para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los finetuning y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar finetuning para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los checkpoints y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar checkpoints para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los evaluadores y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar evaluadores para evaluar el rendimiento de un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los pre-trainados y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar pre-trainados para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los finetuning y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar finetuning para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los checkpoints y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar checkpoints para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los evaluadores y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar evaluadores para evaluar el rendimiento de un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los pre-trainados y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar pre-trainados para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los finetuning y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar finetuning para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los checkpoints y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar checkpoints para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los evaluadores y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar evaluadores para evaluar el rendimiento de un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los pre-trainados y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar pre-trainados para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los finetuning y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar finetuning para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los checkpoints y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar checkpoints para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los evaluadores y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar evaluadores para evaluar el rendimiento de un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los pre-trainados y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar pre-trainados para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los finetuning y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar finetuning para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los checkpoints y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar checkpoints para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los evaluadores y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar evaluadores para evaluar el rendimiento de un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los pre-trainados y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar pre-trainados para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los finetuning y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar finetuning para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los checkpoints y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar checkpoints para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los evaluadores y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar evaluadores para evaluar el rendimiento de un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los pre-trainados y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar pre-trainados para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los finetuning'}]
As you can see, it did not answer, but kept asking questions. This is because Llama3 is a language model that predicts the next token, so with the prompt that I have passed it, it has considered that the next best tokens are those that correspond to more questions. Which makes sense, because there are times when people have doubts about a topic and generates many questions, so to answer the question we have to condition it a little bit
%%writefile accelerate_scripts/17_inference_with_pipeline_condition.pyfrom transformers import pipelinecheckpoints = "meta-llama/Meta-Llama-3-8B-Instruct"generator = pipeline(model=checkpoints, device_map="auto")prompt = "Conoces accelerate de hugging face?"messages = [{"role": "system","content": "Eres un chatbot amigable que siempre intenta solucionar las dudas",},{"role": "user", "content": f"{prompt}"},]output = generator(messages)print(output[0]['generated_text'][-1])
Overwriting accelerate_scripts/10_inference_with_pipeline_condition.py
As you can see, a message has been generated with roles, conditioning the model and with the following prompt
!python accelerate_scripts/17_inference_with_pipeline_condition.py
Loading checkpoint shards: 100%|██████████████████| 4/4 [00:09<00:00, 2.41s/it]Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.{'role': 'assistant', 'content': '¡Hola! Sí, conozco Accelerate de Hugging Face. Accelerate es una biblioteca de Python desarrollada por Hugging Face que se enfoca en simplificar y acelerar el entrenamiento y la evaluación de modelos de lenguaje en diferentes dispositivos y entornos. Con Accelerate, puedes entrenar modelos de lenguaje en diferentes plataformas y dispositivos, como GPUs, TPUs, CPUs y servidores, sin necesidad de cambiar el código de tu modelo. Esto te permite aprovechar al máximo la potencia de cálculo de tus dispositivos y reducir el tiempo de entrenamiento. Accelerate también ofrece varias características adicionales, como: * Soporte para diferentes frameworks de machine learning, como TensorFlow, PyTorch y JAX. * Integración con diferentes sistemas de almacenamiento y procesamiento de datos, como Amazon S3 y Google Cloud Storage. * Soporte para diferentes protocolos de comunicación, como HTTP y gRPC. * Herramientas para monitorear y depurar tus modelos en tiempo real. En resumen, Accelerate es una herramienta muy útil para desarrolladores de modelos de lenguaje que buscan simplificar y acelerar el proceso de entrenamiento y evaluación de sus modelos. ¿Tienes alguna pregunta específica sobre Accelerate o necesitas ayuda para implementarlo en tu proyecto?'}
Now the answer if it responds to our prompt
Inference with AutoClass
Finally we will see how to do the inference only with AutoClass
.
%%writefile accelerate_scripts/18_inference_with_autoclass.pyfrom transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamercheckpoints = "meta-llama/Meta-Llama-3-8B-Instruct"tokenizer = AutoTokenizer.from_pretrained(checkpoints, device_map="auto")model = AutoModelForCausalLM.from_pretrained(checkpoints, device_map="auto")streamer = TextStreamer(tokenizer)prompt = "Conoces accelerate de hugging face?"tokens_input = tokenizer([prompt], return_tensors="pt").to(model.device)_ = model.generate(**tokens_input, streamer=streamer, max_new_tokens=500, do_sample=True, top_k=50, top_p=0.95, temperature=0.7)
Overwriting accelerate_scripts/11_inference_with_autoclass.py
As you can see, the streamer
object has been created and then passed to the generate
method of the model. This is useful so that each word is printed as it is generated and you don't have to wait for all the output to be generated before printing it.
!python accelerate_scripts/18_inference_with_autoclass.py
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.Loading checkpoint shards: 100%|██████████████████| 4/4 [00:09<00:00, 2.28s/it]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.<|begin_of_text|>Conoces accelerate de hugging face? Si es así, puedes utilizar la biblioteca `transformers` de Hugging Face para crear un modelo de lenguaje que pueda predecir la siguiente palabra en una secuencia de texto.Aquí te muestro un ejemplo de cómo hacerlo:```import pandas as pdimport torchfrom transformers import AutoModelForSequenceClassification, AutoTokenizer# Cargar el modelo y el tokenizadormodel_name = "bert-base-uncased"model = AutoModelForSequenceClassification.from_pretrained(model_name)tokenizer = AutoTokenizer.from_pretrained(model_name)# Cargar el conjunto de datostrain_df = pd.read_csv("train.csv")test_df = pd.read_csv("test.csv")# Preprocesar los datostrain_texts = train_df["text"]train_labels = train_df["label"]test_texts = test_df["text"]# Convertir los textos en entradas para el modelotrain_encodings = tokenizer.batch_encode_plus(train_texts,add_special_tokens=True,max_length=512,return_attention_mask=True,return_tensors='pt')test_encodings = tokenizer.batch_encode_plus(test_texts,add_special_tokens=True,max_length=512,return_attention_mask=True,return_tensors='pt')# Crear un dataloader para entrenar el modelotrain_dataset = torch.utils.data.TensorDataset(train_encodings["input_ids"],train_encodings["attention_mask"],torch.tensor(train_labels))train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=32, shuffle=True)# Entrenar el modelodevice = torch.device("cuda" if torch.cuda.is_available() else "cpu")model.to(device)criterion = torch.nn.CrossEntropyLoss()optimizer = torch.optim.Adam(model.parameters(), lr=1e-5)for epoch in range(5):model.train()total_loss = 0for batch in train_loader:input_ids = batch[0].to(device)attention_mask = batch[1].to(device)labels = batch[2].to(device)optimizer.zero_grad()outputs = model(input_ids, attention_mask=attention_mask, labels=labels)loss = criterion(outputs, labels)loss.backward()optimizer.step()total_loss += loss.item()print(f"Epoch {epoch+1}, Loss: {total
Use pytorch
Normally the way to make inferences with pytorch is to create a model with the weights initialized randomly and then load a state dict
with the weights of the pre-trained model, so to get that state dict
we are going to make a little trick first and download them
import torchimport torchvision.models as modelsmodel = models.resnet152(weights=models.ResNet152_Weights.IMAGENET1K_V1)torch.save(model.state_dict(), 'accelerate_scripts/resnet152_pretrained.pth')
Downloading: "https://download.pytorch.org/models/resnet152-394f9c45.pth" to /home/maximo.fernandez/.cache/torch/hub/checkpoints/resnet152-394f9c45.pth100%|██████████| 230M/230M [02:48<00:00, 1.43MB/s]
Now that we have the state dict
let's do inference as it is normally done in pytorch
import torchimport torchvision.models as modelsdevice = "cuda" if torch.cuda.is_available() else "cpu" # Set deviceresnet152 = models.resnet152().to(device) # Create model with random weights and move to devicestate_dict = torch.load('accelerate_scripts/resnet152_pretrained.pth', map_location=device) # Load pretrained weights into device memoryresnet152.load_state_dict(state_dict) # Load this weights into the modelinput = torch.rand(1, 3, 224, 224).to(device) # Random image with batch size 1output = resnet152(input)output.shape
torch.Size([1, 1000])
Let us explain what happened
- When we did
resnet152 = models.resnet152().to(device)
a resnet152 with random weights was loaded into GPU memory. - When we did
state_dict = torch.load('accelerate_scripts/resnet152_pretrained.pth', map_location=device)
a dictionary with the trained weights was loaded into GPU memory. - When we have done
resnet152.load_state_dict(state_dict)
these pre-trained weights have been assigned to the model.
In other words, the model has been loaded twice in the GPU memory.
You may be wondering why we have done this first.
model = models.resnet152(weights=models.ResNet152_Weights.IMAGENET1K_V1)
torch.save(model.state_dict(), 'accelerate_scripts/resnet152_pretrained.pth')
To then make
resnet152 = models.resnet152().to(device)
state_dict = torch.load('accelerate_scripts/resnet152_pretrained.pth', map_location=device)
resnet152.load_state_dict(state_dict)
And why don't we use directly
model = models.resnet152(weights=models.ResNet152_Weights.IMAGENET1K_V1)
And we stop saving the state dict
to load it later. Well, because Pytorch, by edbajo does the same thing that we have done. So to be able to see the whole process we have done in several lines what Pytorch does in one line
This way of working has worked well until now, as long as the models had a manageable size for user GPUs. But since the advent of LLMs this approach does not make sense.
For example, a 6B model of parameters would occupy 24 GB of memory, and since it is loaded twice with this way of working, a 48 GB GPU would be required.
So to fix this, the way to load a pre-trained Pytorch model is:
- Create an empty model with
init_empty_weights
that will not occupy RAM. - Then load the weights with
load_checkpoint_and_dispatch
which will load a checkpoint inside the empty model and distribute the weights for each layer on all available devices (GPU, CPU RAM and hard disk), thanks to settingdevice_map="auto"
.
import torchimport torchvision.models as modelsfrom accelerate import init_empty_weights, load_checkpoint_and_dispatchwith init_empty_weights():resnet152 = models.resnet152()resnet152 = load_checkpoint_and_dispatch(resnet152, checkpoint='accelerate_scripts/resnet152_pretrained.pth', device_map="auto")device = "cuda" if torch.cuda.is_available() else "cpu"input = torch.rand(1, 3, 224, 224).to(device) # Random image with batch size 1output = resnet152(input)output.shape
torch.Size([1, 1000])
How accelerate works below
In this video you can see graphically how accelerate works below
Initialization of an empty model
Accelerate creates the skeleton of an empty model using init_empty_weights
to occupy as little memory as possible.
For example, let's see how much RAM I now have available on my computer
import psutildef get_ram_info():ram = dict(psutil.virtual_memory()._asdict())print(f"Total RAM: {(ram['total']/1024/1024/1024):.2f} GB, Available RAM: {(ram['available']/1024/1024/1024):.2f} GB, Used RAM: {(ram['used']/1024/1024/1024):.2f} GB")get_ram_info()
Total RAM: 31.24 GB, Available RAM: 22.62 GB, Used RAM: 7.82 GB
I have about 22 GB of RAM available
Now let's try to create a model 5000x1000x1000 parameters, i.e. 5B parameters, if each parameter is in FP32, it means 20 GB of RAM.
import torchfrom torch import nnmodel = nn.Sequential(*[nn.Linear(5000, 1000) for _ in range(1000)])
If we look at RAM again
import torchfrom torch import nnmodel = nn.Sequential(*[nn.Linear(5000, 1000) for _ in range(1000)])get_ram_info()
Total RAM: 31.24 GB, Available RAM: 3.77 GB, Used RAM: 26.70 GB
As we can see now we only have 3 GB of RAM available.
Now let's delete the model to free RAM
del modelget_ram_info()
Total RAM: 31.24 GB, Available RAM: 22.44 GB, Used RAM: 8.03 GB
We again have about 22 GB of RAM available.
Let's now use init_empty_weights
from accelerate
and then we see the RAM
from accelerate import init_empty_weightswith init_empty_weights():model = nn.Sequential(*[nn.Linear(5000, 1000) for _ in range(1000)])get_ram_info()
Total RAM: 31.24 GB, Available RAM: 22.32 GB, Used RAM: 8.16 GB
Before we had exactly 22.44 GB free and after creating the model with init_empty_weights
we have 22.32 GB. The saving in RAM is enormous! Almost no RAM has been used to create the model.
This is based on the metadevice introduced in PyTorch 1.9, so it is important that to use accelerate
we have a later version of Pytorch.
Loading weights
Once we have initialized the model we have to load the weights using load_checkpoint_and_dispatch
which, as its name indicates, loads the weights and sends them to the device or devices required.