LoRA - low rank adaptation of large language models![link image 29](/icons/link.svg)
This notebook has been automatically translated to make it accessible to more people, please let me know if you see any typos.
The increasing size of language models makes it more and more expensive to train them because more and more VRAM is needed to store all their parameters and the gradients derived from training
In the paper LoRA - Low rank adaption of large language models they propose to freeze the model weights and train two matrices called A and B greatly reducing the number of parameters to be trained.
Let's see how this is done
LoRA explanation![link image 30](/icons/link.svg)
Updating of weights in a neural network![link image 31](/icons/link.svg)
To understand how LoRA works, we first have to remember what happens when we train a model. Let's go back to the most basic part of deep learning, we have a dense layer of a neural network that is defined as:
$$ y = Wx + b $$Where $W$ is the weights matrix and $b$ is the bias vector.
For the sake of simplicity we will assume that there is no bias, so it would look like this
$$ y = Wx $$Suppose that for an input $x$ we want it to have an output $ŷ$.
- First what we do is to calculate the output we get with our current value of pesos $W$, i.e. we get the value $y$.
- Next we calculate the error that exists between the value of $y$ that we have obtained and the value that we wanted to obtain $ŷ$. We call this error $loss$, and we calculate it with some mathematical function, now it does not matter which one.
- We compute the gardient (the derivative) of the error $loss$ with respect to the weights matrix $W$, i.e. $$Delta W = \frac{dloss}{dW}$.
- We update the weights $W$ by subtracting from each of their values the value of the gradient multiplied by a learning factor $alpha$, i.e. $W = W - \alpha \Delta W$.
LoRA![link image 32](/icons/link.svg)
The authors of LoRA propose that the weights matrix $W$ can be decomposed into
$$ W \sim W + \Delta W $$So, by freezing the $W$ matrix and training only the $"Delta W$ matrix, it is possible to obtain a model that fits new data without having to retrain the whole model.
But you may think that $$Delta W$ is a matrix of size equal to $W$ so nothing has been gained, but here the authors rely on Aghajanyan et al. (2020)
, a paper in which they showed that although the language models are large and their parameters are matrices with very large dimensions, to adapt them to new tasks it is not necessary to change all the values of the matrices, but changing a few values is enough, which in technical terms, is called Low Rank Adaptation. Hence the name LoRA (Low Rank Adaptation).
We have frozen the model and now we want to train the $\Delta W$ matrix, let's assume that both $W$ and $\Delta W$ are matrices of size $20 \times 10$, so we have 200 trainable parameters
Now suppose that the matrix $\Delta W$ can be decomposed into the product of two matrices $A$ and $B$, i.e.
$$ \Delta W = A \cdot B $$For this multiplication to occur the sizes of the matrices $A$ and $B$ have to be $20 \times n$ and $n \times 10$ respectively. Suppose $n = 5$, so $A$ would be of size $20 \times 5$, i.e. 100 parameters, and $B$ of size $5 \times 10$, i.e. 50 parameters, so we would have 100+50=150 trainable parameters. We already have less trainable parameters than before
Now let's suppose that $W$ is actually a matrix of size $10.000 \times 10.000$, so we would have 100.000.000 trainable parameters, but if we decompose $\Delta W$ in $A$ and $B$ with $n = 5$, we would have a matrix of size $10.000 \times 5$ and another one of size $5 \times 10.000$, so we would have 50.000 parameters of one and another 50.000 parameters of the other, in total 100.000 trainable parameters, that is to say we have reduced the number of parameters 1000 times.
You can already see the power of LoRA, when you have very large models, the number of trainable parameters can be greatly reduced.
If we look again at the image of the LoRA architecture, we will understand it better.
But it looks even better, the savings in number of trainable parameters with this image
Implementation of LoRA in transformers![link image 33](/icons/link.svg)
Since language models are implementations of transformers, let's see how LoRA is implemented in transformers. In the transformer architecture there are linear layers in the $Q$, $K$ and $V$ attention matrices, and in the feedforward layers, so LoRA can be applied to all these linear layers. In the paper they say that for simplicity they apply it only to the linear layers of the $Q$, $K$ and $V$ attention matrices.
These layers have a size $d_{model} \times d_{model}$, where $d_{model}$ is the embedding dimension of the model.
Range size r![link image 34](/icons/link.svg)
In order to have these benefits, the size of the range $r$ have to be smaller than the size of the linear layers. Since we have said that they only implemented it in the linear layers of attention, which have a size $d_{model} \times d_{model}$, the rank size $r$ has to be smaller than $d_{model}$.
Initialization of matrices A and B![link image 35](/icons/link.svg)
The matrices $A$ and $B$ are initialized with a random Gaussian distribution for $A$ and zero for $B$, so the product of both matrices will be zero at the beginning, i.e.
$$ \Delta W = A \cdot B = 0 $$Influence of LoRA through the parameter $alpha$![link image 36](/icons/link.svg)
Finally, in the LoRA implementation, a $alpha$ parameter is added to establish the degree of influence of LoRA on training. It is similar to the learning rate in normal fine tuning, but in this case it is used to establish the influence of LoRA on the training. Thus the LoRA formula would look like this
$$ W = W + \alpha \Delta W = W + \alpha A \cdot B $$Advantages of LoRA![link image 37]({)
Now that we understand how it works, let's take a look at the advantages of this approach
- Reduction of the number of trainable parameters. As we have seen, the number of trainable parameters is drastically reduced, which makes training much faster and less VRAM is needed, thus saving a lot of costs.
- Adapters in production. We can have in production a single language model and several adapters, each one for a different task, instead of having several models trained for each task, thus saving storage and computational costs. Moreover, this method does not have to add latency in the inference because we can merge the original weight matrix with the adapter, since we have seen that $W \sim W + \Delta W = W + A \cdot B$, so the inference time would be the same as using the original language model.
- Buying and sharing adapters. If we train an adapter, we can share only the adapter. That is, in production, everyone can have the original model and every time we train an adapter we can share only the adapter, so as much smaller arrays would be shared, the size of the files to be shared would be much smaller.
Implementation of LoRA in an LLM![link image 38](/icons/link.svg)
We are going to repeat the training code of the post Fine tuning SLMs, specifically the training for text classification with the Hugging Face libraries, but this time we are going to do it with LoRA. In the previous post we used a batch size of 28 for the training loop and 40 for the evaluation loop, however, as now we are not going to train all the weights of the model, but only the LoRA matrices, we will be able to use a bigger batch size
Login to the Hub![link image 39](/icons/link.svg)
We log in to upload the model to the Hub
from huggingface_hub import notebook_loginnotebook_login()
Dataset![link image 40](/icons/link.svg)
We download the dataset we are going to use, which is a dataset of reviews from Amazon
from huggingface_hub import notebook_loginnotebook_login()from datasets import load_datasetdataset = load_dataset("mteb/amazon_reviews_multi", "en")dataset
DatasetDict({train: Dataset({features: ['id', 'text', 'label', 'label_text'],num_rows: 200000})validation: Dataset({features: ['id', 'text', 'label', 'label_text'],num_rows: 5000})test: Dataset({features: ['id', 'text', 'label', 'label_text'],num_rows: 5000})})
We create a subset in case you want to test the code with a smaller dataset. In my case I will use 100% of the dataset
percentage = 1subset_dataset_train = dataset['train'].select(range(int(len(dataset['train']) * percentage)))subset_dataset_validation = dataset['validation'].select(range(int(len(dataset['validation']) * percentage)))subset_dataset_test = dataset['test'].select(range(int(len(dataset['test']) * percentage)))subset_dataset_train, subset_dataset_validation, subset_dataset_test
(Dataset({features: ['id', 'text', 'label', 'label_text'],num_rows: 200000}),Dataset({features: ['id', 'text', 'label', 'label_text'],num_rows: 5000}),Dataset({features: ['id', 'text', 'label', 'label_text'],num_rows: 5000}))
We see a sample
from random import randintidx = randint(0, len(subset_dataset_train))subset_dataset_train[idx]
{'id': 'en_0388304','text': 'The N was missing from on The N was missing from on','label': 0,'label_text': '0'}
We obtain the number of classes, to obtain the number of classes we use dataset['train']
and not subset_dataset_train
because if the subset is too small it is possible that there are no examples with all the possible classes of the original dataset.
num_classes = len(dataset['train'].unique('label'))num_classes
We create a function to create the label
field in the dataset. The downloaded dataset has the labels
field but the transformers
library needs the field to be called label
and not labels
def set_labels(example):example['labels'] = example['label']return example
We apply the function to the dataset
def set_labels(example):example['labels'] = example['label']return examplesubset_dataset_train = subset_dataset_train.map(set_labels)subset_dataset_validation = subset_dataset_validation.map(set_labels)subset_dataset_test = subset_dataset_test.map(set_labels)subset_dataset_train, subset_dataset_validation, subset_dataset_test
(Dataset({features: ['id', 'text', 'label', 'label_text', 'labels'],num_rows: 200000}),Dataset({features: ['id', 'text', 'label', 'label_text', 'labels'],num_rows: 5000}),Dataset({features: ['id', 'text', 'label', 'label_text', 'labels'],num_rows: 5000}))
Here is a sample again
{'id': 'en_0388304','text': 'The N was missing from on The N was missing from on','label': 0,'label_text': '0','labels': 0}
Tokenizer![link image 41](/icons/link.svg)
We implement the tokenizer. To avoid errors, we assign the end of string token to the padding token.
from transformers import AutoTokenizercheckpoint = "openai-community/gpt2"tokenizer = AutoTokenizer.from_pretrained(checkpoint)tokenizer.pad_token = tokenizer.eos_token
We create a function for tokenizing the dataset
from transformers import AutoTokenizercheckpoint = "openai-community/gpt2"tokenizer = AutoTokenizer.from_pretrained(checkpoint)tokenizer.pad_token = tokenizer.eos_tokendef tokenize_function(examples):return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=768, return_tensors="pt")
We apply the function to the dataset and remove the columns that we do not need
from transformers import AutoTokenizercheckpoint = "openai-community/gpt2"tokenizer = AutoTokenizer.from_pretrained(checkpoint)tokenizer.pad_token = tokenizer.eos_tokendef tokenize_function(examples):return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=768, return_tensors="pt")subset_dataset_train = subset_dataset_train.map(tokenize_function, batched=True, remove_columns=['text', 'label', 'id', 'label_text'])subset_dataset_validation = subset_dataset_validation.map(tokenize_function, batched=True, remove_columns=['text', 'label', 'id', 'label_text'])subset_dataset_test = subset_dataset_test.map(tokenize_function, batched=True, remove_columns=['text', 'label', 'id', 'label_text'])subset_dataset_train, subset_dataset_validation, subset_dataset_test
(Dataset({features: ['labels', 'input_ids', 'attention_mask'],num_rows: 200000}),Dataset({features: ['labels', 'input_ids', 'attention_mask'],num_rows: 5000}),Dataset({features: ['labels', 'input_ids', 'attention_mask'],num_rows: 5000}))
We see again a sample, but in this case we only see the keys
dict_keys(['labels', 'input_ids', 'attention_mask'])
Model![link image 42](/icons/link.svg)
We instantiate the model. Also, in order to avoid errors, we assign the end of string token to the padding token.
from transformers import AutoModelForSequenceClassificationmodel = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=num_classes)model.config.pad_token_id = model.config.eos_token_id
Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at openai-community/gpt2 and are newly initialized: ['score.weight']You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
As we have already seen in the post Fine tuning SLMs we get a warning that some layers have not been initialized. This is because in this case, as it is a classification problem and when we have instantiated the model we have told it that we want it to be a classification model with 5 classes, the library has removed the last layer and replaced it with a 5 neuron one at the output. If you do not understand this well go to the post that I quote that is better eplicado
LoRA![link image 43](/icons/link.svg)
Before implementing LoRA, we look at the number of trainable parameters that the model has
total_params = sum(p.numel() for p in model.parameters() if p.requires_grad)print(f"Total trainable parameters before: {total_params:,}")
Total trainable parameters before: 124,443,648
We see that it has 124M trainable parameters. Now let's freeze them
for param in model.parameters():param.requires_grad = Falsetotal_params = sum(p.numel() for p in model.parameters() if p.requires_grad)print(f"Total trainable parameters after: {total_params:,}")
Total trainable parameters after: 0
After freezing there are no more trainable parameters
Let's see what the model looks like before applying LoRA
GPT2ForSequenceClassification((transformer): GPT2Model((wte): Embedding(50257, 768)(wpe): Embedding(1024, 768)(drop): Dropout(p=0.1, inplace=False)(h): ModuleList((0-11): 12 x GPT2Block((ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)(attn): GPT2Attention((c_attn): Conv1D()(c_proj): Conv1D()(attn_dropout): Dropout(p=0.1, inplace=False)(resid_dropout): Dropout(p=0.1, inplace=False))(ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)(mlp): GPT2MLP((c_fc): Conv1D()(c_proj): Conv1D()(act): NewGELUActivation()(dropout): Dropout(p=0.1, inplace=False))))(ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True))(score): Linear(in_features=768, out_features=5, bias=False))
First we create the LoRA layer.
It has to inherit from torch.nn.Module
so that it can act as a layer of a neural network.
In the _init_
method we create the A
and B
matrices initialized as explained before, the A
matrix with a random Gaussian distribution and the B
matrix with zeros. We also create the parameters rank
and alpha
In the forward
method we calculate LoRA as explained above.
import torchclass LoRALayer(torch.nn.Module):def __init__(self, in_dim, out_dim, rank, alpha):super().__init__()self.A = torch.nn.Parameter(torch.empty(in_dim, rank))torch.nn.init.kaiming_uniform_(self.A, a=torch.sqrt(torch.tensor(5.)).item()) # similar to standard weight initializationself.B = torch.nn.Parameter(torch.zeros(rank, out_dim))self.alpha = alphadef forward(self, x):x = self.alpha * (x @ self.A @ self.B)return x
Now we create a linear class with LoRA.
As before, it inherits from torch.nn.Module
so that it can act as a layer of a neural network.
In the _init_
method we create a variable with the original linear layer of the network and we create another variable with the new LoRA layer that we had implemented before
In the forward
method we add the outputs of the original linear layer and the LoRA layer.
import torchclass LoRALayer(torch.nn.Module):def __init__(self, in_dim, out_dim, rank, alpha):super().__init__()self.A = torch.nn.Parameter(torch.empty(in_dim, rank))torch.nn.init.kaiming_uniform_(self.A, a=torch.sqrt(torch.tensor(5.)).item()) # similar to standard weight initializationself.B = torch.nn.Parameter(torch.zeros(rank, out_dim))self.alpha = alphadef forward(self, x):x = self.alpha * (x @ self.A @ self.B)return xclass LoRALinear(torch.nn.Module):def __init__(self, linear, rank, alpha):super().__init__()self.linear = linearself.lora = LoRALayer(linear.in_features, linear.out_features, rank, alpha)def forward(self, x):return self.linear(x) + self.lora(x)
Finally we create a function that replaces the linear layers by the new linear layer with LoRA that we have created. What it does is that if it finds a linear layer in the model, it replaces it with the linear layer with LoRA, if not, it applies the function within the sublayers of the layer.
import torchclass LoRALayer(torch.nn.Module):def __init__(self, in_dim, out_dim, rank, alpha):super().__init__()self.A = torch.nn.Parameter(torch.empty(in_dim, rank))torch.nn.init.kaiming_uniform_(self.A, a=torch.sqrt(torch.tensor(5.)).item()) # similar to standard weight initializationself.B = torch.nn.Parameter(torch.zeros(rank, out_dim))self.alpha = alphadef forward(self, x):x = self.alpha * (x @ self.A @ self.B)return xclass LoRALinear(torch.nn.Module):def __init__(self, linear, rank, alpha):super().__init__()self.linear = linearself.lora = LoRALayer(linear.in_features, linear.out_features, rank, alpha)def forward(self, x):return self.linear(x) + self.lora(x)def replace_linear_with_lora(model, rank, alpha):for name, module in model.named_children():if isinstance(module, torch.nn.Linear):# Replace the Linear layer with LinearWithLoRAsetattr(model, name, LoRALinear(module, rank, alpha))else:# Recursively apply the same function to child modulesreplace_linear_with_lora(module, rank, alpha)
We apply the function to the model to replace the linear layers of the model by the new linear layer with LoRA
import torchclass LoRALayer(torch.nn.Module):def __init__(self, in_dim, out_dim, rank, alpha):super().__init__()self.A = torch.nn.Parameter(torch.empty(in_dim, rank))torch.nn.init.kaiming_uniform_(self.A, a=torch.sqrt(torch.tensor(5.)).item()) # similar to standard weight initializationself.B = torch.nn.Parameter(torch.zeros(rank, out_dim))self.alpha = alphadef forward(self, x):x = self.alpha * (x @ self.A @ self.B)return xclass LoRALinear(torch.nn.Module):def __init__(self, linear, rank, alpha):super().__init__()self.linear = linearself.lora = LoRALayer(linear.in_features, linear.out_features, rank, alpha)def forward(self, x):return self.linear(x) + self.lora(x)def replace_linear_with_lora(model, rank, alpha):for name, module in model.named_children():if isinstance(module, torch.nn.Linear):# Replace the Linear layer with LinearWithLoRAsetattr(model, name, LoRALinear(module, rank, alpha))else:# Recursively apply the same function to child modulesreplace_linear_with_lora(module, rank, alpha)rank = 16alpha = 16replace_linear_with_lora(model, rank=rank, alpha=alpha)
We now see the number of trainable parameters
import torchclass LoRALayer(torch.nn.Module):def __init__(self, in_dim, out_dim, rank, alpha):super().__init__()self.A = torch.nn.Parameter(torch.empty(in_dim, rank))torch.nn.init.kaiming_uniform_(self.A, a=torch.sqrt(torch.tensor(5.)).item()) # similar to standard weight initializationself.B = torch.nn.Parameter(torch.zeros(rank, out_dim))self.alpha = alphadef forward(self, x):x = self.alpha * (x @ self.A @ self.B)return xclass LoRALinear(torch.nn.Module):def __init__(self, linear, rank, alpha):super().__init__()self.linear = linearself.lora = LoRALayer(linear.in_features, linear.out_features, rank, alpha)def forward(self, x):return self.linear(x) + self.lora(x)def replace_linear_with_lora(model, rank, alpha):for name, module in model.named_children():if isinstance(module, torch.nn.Linear):# Replace the Linear layer with LinearWithLoRAsetattr(model, name, LoRALinear(module, rank, alpha))else:# Recursively apply the same function to child modulesreplace_linear_with_lora(module, rank, alpha)rank = 16alpha = 16replace_linear_with_lora(model, rank=rank, alpha=alpha)total_params = sum(p.numel() for p in model.parameters() if p.requires_grad)print(f"Total trainable LoRA parameters: {total_params:,}")
Total trainable LoRA parameters: 12,368
We have gone from 124M trainable parameters to 12k trainable parameters, i.e. we have reduced the number of trainable parameters 10,000 times!
We see the model again
GPT2ForSequenceClassification((transformer): GPT2Model((wte): Embedding(50257, 768)(wpe): Embedding(1024, 768)(drop): Dropout(p=0.1, inplace=False)(h): ModuleList((0-11): 12 x GPT2Block((ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)(attn): GPT2Attention((c_attn): Conv1D()(c_proj): Conv1D()(attn_dropout): Dropout(p=0.1, inplace=False)(resid_dropout): Dropout(p=0.1, inplace=False))(ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)(mlp): GPT2MLP((c_fc): Conv1D()(c_proj): Conv1D()(act): NewGELUActivation()(dropout): Dropout(p=0.1, inplace=False))))(ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True))(score): LoRALinear((linear): Linear(in_features=768, out_features=5, bias=False)(lora): LoRALayer()))
Let's compare them layer by layer
Original Model | Model with LoRA |
GPT2ForSequenceClassification( | GPT2ForSequenceClassification( |
(transformer): GPT2Model( | (transformer): GPT2Model( |
(wte): Embedding(50257, 768) | (wte): Embedding(50257, 768) |
(wpe): Embedding(1024, 768) | (wpe): Embedding(1024, 768) |
(drop): Dropout(p=0.1, inplace=False) | (drop): Dropout(p=0.1, inplace=False) |
(h): ModuleList( | (h): ModuleList( |
(0-11): 12 x GPT2Block( | (0-11): 12 x GPT2Block( |
(ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True) | (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True) |
(attn): GPT2Attention( | (attn): GPT2Attention( |
(c_attn): Conv1D() | (c_attn): Conv1D() |
(c_proj): Conv1D() | (c_proj): Conv1D() |
(attn_dropout): Dropout(p=0.1, inplace=False) | (attn_dropout): Dropout(p=0.1, inplace=False) |
(resid_dropout): Dropout(p=0.1, inplace=False) | (resid_dropout): Dropout(p=0.1, inplace=False) |
) | ) |
(ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True) | (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True) |
(mlp): GPT2MLP( | (mlp): GPT2MLP( |
(c_fc): Conv1D() | (c_fc): Conv1D() |
(c_proj): Conv1D() | (c_proj): Conv1D() |
(act): NewGELUActivation() | (act): NewGELUActivation() |
(dropout): Dropout(p=0.1, inplace=False) | (dropout): Dropout(p=0.1, inplace=False) |
) | ) |
) | ) |
) | ) |
(ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True) | (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True) |
) | ) |
(score): LoRALinear() | |
(score): Linear(in_features=768, out_features=5, bias=False) | (linear): Linear(in_features=768, out_features=5, bias=False) |
(lora): LoRALayer() | |
) | |
) | ) |
We see that they are the same except at the end, where in the original model there was a normal linear layer and in the model with LoRA there is a LoRALinear
layer that inside has the linear layer of the original model and a LoRALayer
Training![link image 44](/icons/link.svg)
Once the model has been instantiated with LoRA, let's train it as usual
As we have said, in the post Fine tuning SLMs we used a batch size of 28 for the training loop and 40 for the evaluation loop, while now that there are fewer trainable parameters we can use a larger batch size.
Why does this happen? When training a model, the model and its gradients must be saved in the GPU memory, so both with LoRA and without LoRA the model must be saved, but in the case of LoRA only the gradients of 12k parameters are saved, while with LoRA the gradients of 128M parameters are saved, so with LoRA less GPU memory is needed, so a larger batch size can be used.
from transformers import TrainingArgumentsmetric_name = "accuracy"model_name = "GPT2-small-LoRA-finetuned-amazon-reviews-en-classification"LR = 2e-5BS_TRAIN = 400BS_EVAL = 400EPOCHS = 3WEIGHT_DECAY = 0.01training_args = TrainingArguments(model_name,eval_strategy="epoch",save_strategy="epoch",learning_rate=LR,per_device_train_batch_size=BS_TRAIN,per_device_eval_batch_size=BS_EVAL,num_train_epochs=EPOCHS,weight_decay=WEIGHT_DECAY,lr_scheduler_type="cosine",warmup_ratio = 0.1,fp16=True,load_best_model_at_end=True,metric_for_best_model=metric_name,push_to_hub=True,logging_dir="./runs",)
from transformers import TrainingArgumentsmetric_name = "accuracy"model_name = "GPT2-small-LoRA-finetuned-amazon-reviews-en-classification"LR = 2e-5BS_TRAIN = 400BS_EVAL = 400EPOCHS = 3WEIGHT_DECAY = 0.01training_args = TrainingArguments(model_name,eval_strategy="epoch",save_strategy="epoch",learning_rate=LR,per_device_train_batch_size=BS_TRAIN,per_device_eval_batch_size=BS_EVAL,num_train_epochs=EPOCHS,weight_decay=WEIGHT_DECAY,lr_scheduler_type="cosine",warmup_ratio = 0.1,fp16=True,load_best_model_at_end=True,metric_for_best_model=metric_name,push_to_hub=True,logging_dir="./runs",)import numpy as npfrom evaluate import loadmetric = load("accuracy")def compute_metrics(eval_pred):print(eval_pred)predictions, labels = eval_predpredictions = np.argmax(predictions, axis=1)return metric.compute(predictions=predictions, references=labels)
from transformers import TrainingArgumentsmetric_name = "accuracy"model_name = "GPT2-small-LoRA-finetuned-amazon-reviews-en-classification"LR = 2e-5BS_TRAIN = 400BS_EVAL = 400EPOCHS = 3WEIGHT_DECAY = 0.01training_args = TrainingArguments(model_name,eval_strategy="epoch",save_strategy="epoch",learning_rate=LR,per_device_train_batch_size=BS_TRAIN,per_device_eval_batch_size=BS_EVAL,num_train_epochs=EPOCHS,weight_decay=WEIGHT_DECAY,lr_scheduler_type="cosine",warmup_ratio = 0.1,fp16=True,load_best_model_at_end=True,metric_for_best_model=metric_name,push_to_hub=True,logging_dir="./runs",)import numpy as npfrom evaluate import loadmetric = load("accuracy")def compute_metrics(eval_pred):print(eval_pred)predictions, labels = eval_predpredictions = np.argmax(predictions, axis=1)return metric.compute(predictions=predictions, references=labels)from transformers import Trainertrainer = Trainer(model,training_args,train_dataset=subset_dataset_train,eval_dataset=subset_dataset_validation,tokenizer=tokenizer,compute_metrics=compute_metrics,)
Evaluation![link image 45](/icons/link.svg)
Once trained we evaluate on the test dataset
Publish the model![link image 46](/icons/link.svg)
Now that we have our model trained, we can share it with the world, so first we create a model card.
from transformers import TrainingArgumentsmetric_name = "accuracy"model_name = "GPT2-small-LoRA-finetuned-amazon-reviews-en-classification"LR = 2e-5BS_TRAIN = 400BS_EVAL = 400EPOCHS = 3WEIGHT_DECAY = 0.01training_args = TrainingArguments(model_name,eval_strategy="epoch",save_strategy="epoch",learning_rate=LR,per_device_train_batch_size=BS_TRAIN,per_device_eval_batch_size=BS_EVAL,num_train_epochs=EPOCHS,weight_decay=WEIGHT_DECAY,lr_scheduler_type="cosine",warmup_ratio = 0.1,fp16=True,load_best_model_at_end=True,metric_for_best_model=metric_name,push_to_hub=True,logging_dir="./runs",)import numpy as npfrom evaluate import loadmetric = load("accuracy")def compute_metrics(eval_pred):print(eval_pred)predictions, labels = eval_predpredictions = np.argmax(predictions, axis=1)return metric.compute(predictions=predictions, references=labels)from transformers import Trainertrainer = Trainer(model,training_args,train_dataset=subset_dataset_train,eval_dataset=subset_dataset_validation,tokenizer=tokenizer,compute_metrics=compute_metrics,)trainer.train()trainer.evaluate(eval_dataset=subset_dataset_test)trainer.create_model_card()
And now we can publish it. As the first thing we have done is to log in with the huggingface hub, we can upload it to our hub without any problem.
from transformers import TrainingArgumentsmetric_name = "accuracy"model_name = "GPT2-small-LoRA-finetuned-amazon-reviews-en-classification"LR = 2e-5BS_TRAIN = 400BS_EVAL = 400EPOCHS = 3WEIGHT_DECAY = 0.01training_args = TrainingArguments(model_name,eval_strategy="epoch",save_strategy="epoch",learning_rate=LR,per_device_train_batch_size=BS_TRAIN,per_device_eval_batch_size=BS_EVAL,num_train_epochs=EPOCHS,weight_decay=WEIGHT_DECAY,lr_scheduler_type="cosine",warmup_ratio = 0.1,fp16=True,load_best_model_at_end=True,metric_for_best_model=metric_name,push_to_hub=True,logging_dir="./runs",)import numpy as npfrom evaluate import loadmetric = load("accuracy")def compute_metrics(eval_pred):print(eval_pred)predictions, labels = eval_predpredictions = np.argmax(predictions, axis=1)return metric.compute(predictions=predictions, references=labels)from transformers import Trainertrainer = Trainer(model,training_args,train_dataset=subset_dataset_train,eval_dataset=subset_dataset_validation,tokenizer=tokenizer,compute_metrics=compute_metrics,)trainer.train()trainer.evaluate(eval_dataset=subset_dataset_test)trainer.create_model_card()trainer.push_to_hub()
Model test![link image 47](/icons/link.svg)
We clean as much as possible
from transformers import TrainingArgumentsmetric_name = "accuracy"model_name = "GPT2-small-LoRA-finetuned-amazon-reviews-en-classification"LR = 2e-5BS_TRAIN = 400BS_EVAL = 400EPOCHS = 3WEIGHT_DECAY = 0.01training_args = TrainingArguments(model_name,eval_strategy="epoch",save_strategy="epoch",learning_rate=LR,per_device_train_batch_size=BS_TRAIN,per_device_eval_batch_size=BS_EVAL,num_train_epochs=EPOCHS,weight_decay=WEIGHT_DECAY,lr_scheduler_type="cosine",warmup_ratio = 0.1,fp16=True,load_best_model_at_end=True,metric_for_best_model=metric_name,push_to_hub=True,logging_dir="./runs",)import numpy as npfrom evaluate import loadmetric = load("accuracy")def compute_metrics(eval_pred):print(eval_pred)predictions, labels = eval_predpredictions = np.argmax(predictions, axis=1)return metric.compute(predictions=predictions, references=labels)from transformers import Trainertrainer = Trainer(model,training_args,train_dataset=subset_dataset_train,eval_dataset=subset_dataset_validation,tokenizer=tokenizer,compute_metrics=compute_metrics,)trainer.train()trainer.evaluate(eval_dataset=subset_dataset_test)trainer.create_model_card()trainer.push_to_hub()import torchimport gcdef clear_hardwares():torch.clear_autocast_cache()torch.cuda.ipc_collect()torch.cuda.empty_cache()gc.collect()clear_hardwares()clear_hardwares()
As we have uploaded the model to our hub we can download it and use it.
from transformers import TrainingArgumentsmetric_name = "accuracy"model_name = "GPT2-small-LoRA-finetuned-amazon-reviews-en-classification"LR = 2e-5BS_TRAIN = 400BS_EVAL = 400EPOCHS = 3WEIGHT_DECAY = 0.01training_args = TrainingArguments(model_name,eval_strategy="epoch",save_strategy="epoch",learning_rate=LR,per_device_train_batch_size=BS_TRAIN,per_device_eval_batch_size=BS_EVAL,num_train_epochs=EPOCHS,weight_decay=WEIGHT_DECAY,lr_scheduler_type="cosine",warmup_ratio = 0.1,fp16=True,load_best_model_at_end=True,metric_for_best_model=metric_name,push_to_hub=True,logging_dir="./runs",)import numpy as npfrom evaluate import loadmetric = load("accuracy")def compute_metrics(eval_pred):print(eval_pred)predictions, labels = eval_predpredictions = np.argmax(predictions, axis=1)return metric.compute(predictions=predictions, references=labels)from transformers import Trainertrainer = Trainer(model,training_args,train_dataset=subset_dataset_train,eval_dataset=subset_dataset_validation,tokenizer=tokenizer,compute_metrics=compute_metrics,)trainer.train()trainer.evaluate(eval_dataset=subset_dataset_test)trainer.create_model_card()trainer.push_to_hub()import torchimport gcdef clear_hardwares():torch.clear_autocast_cache()torch.cuda.ipc_collect()torch.cuda.empty_cache()gc.collect()clear_hardwares()clear_hardwares()from transformers import pipelineuser = "maximofn"checkpoints = f"{user}/{model_name}"task = "text-classification"classifier = pipeline(task, model=checkpoints, tokenizer=checkpoints)
Now if we want to return the probability of all classes, we simply use the classifier we just instantiated, with the parameter top_k=None
from transformers import TrainingArgumentsmetric_name = "accuracy"model_name = "GPT2-small-LoRA-finetuned-amazon-reviews-en-classification"LR = 2e-5BS_TRAIN = 400BS_EVAL = 400EPOCHS = 3WEIGHT_DECAY = 0.01training_args = TrainingArguments(model_name,eval_strategy="epoch",save_strategy="epoch",learning_rate=LR,per_device_train_batch_size=BS_TRAIN,per_device_eval_batch_size=BS_EVAL,num_train_epochs=EPOCHS,weight_decay=WEIGHT_DECAY,lr_scheduler_type="cosine",warmup_ratio = 0.1,fp16=True,load_best_model_at_end=True,metric_for_best_model=metric_name,push_to_hub=True,logging_dir="./runs",)import numpy as npfrom evaluate import loadmetric = load("accuracy")def compute_metrics(eval_pred):print(eval_pred)predictions, labels = eval_predpredictions = np.argmax(predictions, axis=1)return metric.compute(predictions=predictions, references=labels)from transformers import Trainertrainer = Trainer(model,training_args,train_dataset=subset_dataset_train,eval_dataset=subset_dataset_validation,tokenizer=tokenizer,compute_metrics=compute_metrics,)trainer.train()trainer.evaluate(eval_dataset=subset_dataset_test)trainer.create_model_card()trainer.push_to_hub()import torchimport gcdef clear_hardwares():torch.clear_autocast_cache()torch.cuda.ipc_collect()torch.cuda.empty_cache()gc.collect()clear_hardwares()clear_hardwares()from transformers import pipelineuser = "maximofn"checkpoints = f"{user}/{model_name}"task = "text-classification"classifier = pipeline(task, model=checkpoints, tokenizer=checkpoints)labels = classifier("I love this product", top_k=None)labels
[{'label': 'LABEL_0', 'score': 0.8419149518013},{'label': 'LABEL_1', 'score': 0.09386005252599716},{'label': 'LABEL_3', 'score': 0.03624210134148598},{'label': 'LABEL_2', 'score': 0.02049318142235279},{'label': 'LABEL_4', 'score': 0.0074898069724440575}]
If we only want the class with the highest probability we do the same but with the parameter top_k=1
label = classifier("I love this product", top_k=1)label
[{'label': 'LABEL_0', 'score': 0.8419149518013}]
And if we want n classes we do the same but with the parameter top_k=n
two_labels = classifier("I love this product", top_k=2)two_labels
[{'label': 'LABEL_0', 'score': 0.8419149518013},{'label': 'LABEL_1', 'score': 0.09386005252599716}]
We can also test the model with Automodel and AutoTokenizer.
from transformers import AutoTokenizer, AutoModelForSequenceClassificationimport torchmodel_name = "GPT2-small-finetuned-amazon-reviews-en-classification"user = "maximofn"checkpoint = f"{user}/{model_name}"num_classes = num_classestokenizer = AutoTokenizer.from_pretrained(checkpoint)model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=num_classes).half().eval().to("cuda")
from transformers import AutoTokenizer, AutoModelForSequenceClassificationimport torchmodel_name = "GPT2-small-finetuned-amazon-reviews-en-classification"user = "maximofn"checkpoint = f"{user}/{model_name}"num_classes = num_classestokenizer = AutoTokenizer.from_pretrained(checkpoint)model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=num_classes).half().eval().to("cuda")tokens = tokenizer.encode("I love this product", return_tensors="pt").to(model.device)with torch.no_grad():output = model(tokens)logits = output.logitslables = torch.softmax(logits, dim=1).cpu().numpy().tolist()lables[0]
If you want to test the model further you can see it in Maximofn/GPT2-small-LoRA-finetuned-amazon-reviews-en-classification
Implementation of LoRA in an LLM with PEFT from Hugging Face![link image 48](/icons/link.svg)
We can do the same with the PEFT
library of Hugging Face. Let's take a look at it
Login to the Hub![link image 49](/icons/link.svg)
We log in to upload the model to the Hub
from huggingface_hub import notebook_loginnotebook_login()
Dataset![link image 50](/icons/link.svg)
We re-download the dataset
from huggingface_hub import notebook_loginnotebook_login()from datasets import load_datasetdataset = load_dataset("mteb/amazon_reviews_multi", "en")dataset
DatasetDict({train: Dataset({features: ['id', 'text', 'label', 'label_text'],num_rows: 200000})validation: Dataset({features: ['id', 'text', 'label', 'label_text'],num_rows: 5000})test: Dataset({features: ['id', 'text', 'label', 'label_text'],num_rows: 5000})})
We create a subset in case you want to test the code with a smaller dataset. In my case I will use 100% of the dataset
percentage = 1subset_dataset_train = dataset['train'].select(range(int(len(dataset['train']) * percentage)))subset_dataset_validation = dataset['validation'].select(range(int(len(dataset['validation']) * percentage)))subset_dataset_test = dataset['test'].select(range(int(len(dataset['test']) * percentage)))subset_dataset_train, subset_dataset_validation, subset_dataset_test
(Dataset({features: ['id', 'text', 'label', 'label_text'],num_rows: 200000}),Dataset({features: ['id', 'text', 'label', 'label_text'],num_rows: 5000}),Dataset({features: ['id', 'text', 'label', 'label_text'],num_rows: 5000}))
We obtain the number of classes, to obtain the number of classes we use dataset['train']
and not subset_dataset_train
because if the subset is too small it is possible that there are no examples with all the possible classes of the original dataset.
num_classes = len(dataset['train'].unique('label'))num_classes
We create a function to create the label
field in the dataset. The downloaded dataset has the labels
field but the transformers
library needs the field to be called label
and not labels
def set_labels(example):example['labels'] = example['label']return example
We apply the function to the dataset
def set_labels(example):example['labels'] = example['label']return examplesubset_dataset_train = subset_dataset_train.map(set_labels)subset_dataset_validation = subset_dataset_validation.map(set_labels)subset_dataset_test = subset_dataset_test.map(set_labels)subset_dataset_train, subset_dataset_validation, subset_dataset_test
(Dataset({features: ['id', 'text', 'label', 'label_text', 'labels'],num_rows: 200000}),Dataset({features: ['id', 'text', 'label', 'label_text', 'labels'],num_rows: 5000}),Dataset({features: ['id', 'text', 'label', 'label_text', 'labels'],num_rows: 5000}))
Tokenizer![link image 51](/icons/link.svg)
We instantiate the tokenizer. To avoid errors, we assign the token of end of string to the token of padding
from transformers import AutoTokenizercheckpoint = "openai-community/gpt2"tokenizer = AutoTokenizer.from_pretrained(checkpoint)tokenizer.pad_token = tokenizer.eos_token
We create a function for tokenizing the dataset
from transformers import AutoTokenizercheckpoint = "openai-community/gpt2"tokenizer = AutoTokenizer.from_pretrained(checkpoint)tokenizer.pad_token = tokenizer.eos_tokendef tokenize_function(examples):return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=768, return_tensors="pt")
We apply the function to the dataset and remove the columns that we do not need
from transformers import AutoTokenizercheckpoint = "openai-community/gpt2"tokenizer = AutoTokenizer.from_pretrained(checkpoint)tokenizer.pad_token = tokenizer.eos_tokendef tokenize_function(examples):return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=768, return_tensors="pt")subset_dataset_train = subset_dataset_train.map(tokenize_function, batched=True, remove_columns=['text', 'label', 'id', 'label_text'])subset_dataset_validation = subset_dataset_validation.map(tokenize_function, batched=True, remove_columns=['text', 'label', 'id', 'label_text'])subset_dataset_test = subset_dataset_test.map(tokenize_function, batched=True, remove_columns=['text', 'label', 'id', 'label_text'])subset_dataset_train, subset_dataset_validation, subset_dataset_test
(Dataset({features: ['labels', 'input_ids', 'attention_mask'],num_rows: 200000}),Dataset({features: ['labels', 'input_ids', 'attention_mask'],num_rows: 5000}),Dataset({features: ['labels', 'input_ids', 'attention_mask'],num_rows: 5000}))
Model![link image 52](/icons/link.svg)
We instantiate the model. Also, in order to avoid errors, we assign the end of string token to the padding token.
from transformers import AutoModelForSequenceClassificationmodel = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=num_classes)model.config.pad_token_id = model.config.eos_token_id
Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at openai-community/gpt2 and are newly initialized: ['score.weight']You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
LoRA with PEFT![link image 53](/icons/link.svg)
Before creating the model with LoRA, let's take a look at its layers
GPT2ForSequenceClassification((transformer): GPT2Model((wte): Embedding(50257, 768)(wpe): Embedding(1024, 768)(drop): Dropout(p=0.1, inplace=False)(h): ModuleList((0-11): 12 x GPT2Block((ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)(attn): GPT2Attention((c_attn): Conv1D()(c_proj): Conv1D()(attn_dropout): Dropout(p=0.1, inplace=False)(resid_dropout): Dropout(p=0.1, inplace=False))(ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)(mlp): GPT2MLP((c_fc): Conv1D()(c_proj): Conv1D()(act): NewGELUActivation()(dropout): Dropout(p=0.1, inplace=False))))(ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True))(score): Linear(in_features=768, out_features=5, bias=False))
As we can see there is only one Linear
layer, which is score
and that is the one we are going to replace.
We can create a LoRA configuration with the PEFT library and then apply LoRA to the mo
from peft import LoraConfig, TaskTypepeft_config = LoraConfig(r=16,lora_alpha=32,lora_dropout=0.1,task_type=TaskType.SEQ_CLS,target_modules=["score"],)
With this configuration we have configured a rank of 16 and an alpha of 32. In addition we have added a dropout to the lora layers of 0.1. We have to indicate the task to the LoRA configuration, in this case it is a sequence classification task. Finally we indicate which layers we want to replace, in this case the score
We now apply LoRA to the model
from peft import LoraConfig, TaskTypepeft_config = LoraConfig(r=16,lora_alpha=32,lora_dropout=0.1,task_type=TaskType.SEQ_CLS,target_modules=["score"],)from peft import get_peft_modelmodel = get_peft_model(model, peft_config)
Let's see how many trainable parameters the model has now
from peft import LoraConfig, TaskTypepeft_config = LoraConfig(r=16,lora_alpha=32,lora_dropout=0.1,task_type=TaskType.SEQ_CLS,target_modules=["score"],)from peft import get_peft_modelmodel = get_peft_model(model, peft_config)model.print_trainable_parameters()
trainable params: 12,368 || all params: 124,456,016 || trainable%: 0.0099
We obtain the same trainable parameters as before
Training![link image 54](/icons/link.svg)
Once the model has been instantiated with LoRA, let's train it as usual
from transformers import TrainingArgumentsmetric_name = "accuracy"model_name = "GPT2-small-PEFT-LoRA-finetuned-amazon-reviews-en-classification"LR = 2e-5BS_TRAIN = 400BS_EVAL = 400EPOCHS = 3WEIGHT_DECAY = 0.01training_args = TrainingArguments(model_name,eval_strategy="epoch",save_strategy="epoch",learning_rate=LR,per_device_train_batch_size=BS_TRAIN,per_device_eval_batch_size=BS_EVAL,num_train_epochs=EPOCHS,weight_decay=WEIGHT_DECAY,lr_scheduler_type="cosine",warmup_ratio = 0.1,fp16=True,load_best_model_at_end=True,metric_for_best_model=metric_name,push_to_hub=True,logging_dir="./runs",)
from transformers import TrainingArgumentsmetric_name = "accuracy"model_name = "GPT2-small-PEFT-LoRA-finetuned-amazon-reviews-en-classification"LR = 2e-5BS_TRAIN = 400BS_EVAL = 400EPOCHS = 3WEIGHT_DECAY = 0.01training_args = TrainingArguments(model_name,eval_strategy="epoch",save_strategy="epoch",learning_rate=LR,per_device_train_batch_size=BS_TRAIN,per_device_eval_batch_size=BS_EVAL,num_train_epochs=EPOCHS,weight_decay=WEIGHT_DECAY,lr_scheduler_type="cosine",warmup_ratio = 0.1,fp16=True,load_best_model_at_end=True,metric_for_best_model=metric_name,push_to_hub=True,logging_dir="./runs",)import numpy as npfrom evaluate import loadmetric = load("accuracy")def compute_metrics(eval_pred):print(eval_pred)predictions, labels = eval_predpredictions = np.argmax(predictions, axis=1)return metric.compute(predictions=predictions, references=labels)
from transformers import TrainingArgumentsmetric_name = "accuracy"model_name = "GPT2-small-PEFT-LoRA-finetuned-amazon-reviews-en-classification"LR = 2e-5BS_TRAIN = 400BS_EVAL = 400EPOCHS = 3WEIGHT_DECAY = 0.01training_args = TrainingArguments(model_name,eval_strategy="epoch",save_strategy="epoch",learning_rate=LR,per_device_train_batch_size=BS_TRAIN,per_device_eval_batch_size=BS_EVAL,num_train_epochs=EPOCHS,weight_decay=WEIGHT_DECAY,lr_scheduler_type="cosine",warmup_ratio = 0.1,fp16=True,load_best_model_at_end=True,metric_for_best_model=metric_name,push_to_hub=True,logging_dir="./runs",)import numpy as npfrom evaluate import loadmetric = load("accuracy")def compute_metrics(eval_pred):print(eval_pred)predictions, labels = eval_predpredictions = np.argmax(predictions, axis=1)return metric.compute(predictions=predictions, references=labels)from transformers import Trainertrainer = Trainer(model,training_args,train_dataset=subset_dataset_train,eval_dataset=subset_dataset_validation,tokenizer=tokenizer,compute_metrics=compute_metrics,)
Evaluation![link image 55](/icons/link.svg)
Once trained we evaluate on the test dataset
Publish the model![link image 56](/icons/link.svg)
We create a model card
from transformers import TrainingArgumentsmetric_name = "accuracy"model_name = "GPT2-small-PEFT-LoRA-finetuned-amazon-reviews-en-classification"LR = 2e-5BS_TRAIN = 400BS_EVAL = 400EPOCHS = 3WEIGHT_DECAY = 0.01training_args = TrainingArguments(model_name,eval_strategy="epoch",save_strategy="epoch",learning_rate=LR,per_device_train_batch_size=BS_TRAIN,per_device_eval_batch_size=BS_EVAL,num_train_epochs=EPOCHS,weight_decay=WEIGHT_DECAY,lr_scheduler_type="cosine",warmup_ratio = 0.1,fp16=True,load_best_model_at_end=True,metric_for_best_model=metric_name,push_to_hub=True,logging_dir="./runs",)import numpy as npfrom evaluate import loadmetric = load("accuracy")def compute_metrics(eval_pred):print(eval_pred)predictions, labels = eval_predpredictions = np.argmax(predictions, axis=1)return metric.compute(predictions=predictions, references=labels)from transformers import Trainertrainer = Trainer(model,training_args,train_dataset=subset_dataset_train,eval_dataset=subset_dataset_validation,tokenizer=tokenizer,compute_metrics=compute_metrics,)trainer.train()trainer.evaluate(eval_dataset=subset_dataset_test)trainer.create_model_card()
We publish it
from transformers import TrainingArgumentsmetric_name = "accuracy"model_name = "GPT2-small-PEFT-LoRA-finetuned-amazon-reviews-en-classification"LR = 2e-5BS_TRAIN = 400BS_EVAL = 400EPOCHS = 3WEIGHT_DECAY = 0.01training_args = TrainingArguments(model_name,eval_strategy="epoch",save_strategy="epoch",learning_rate=LR,per_device_train_batch_size=BS_TRAIN,per_device_eval_batch_size=BS_EVAL,num_train_epochs=EPOCHS,weight_decay=WEIGHT_DECAY,lr_scheduler_type="cosine",warmup_ratio = 0.1,fp16=True,load_best_model_at_end=True,metric_for_best_model=metric_name,push_to_hub=True,logging_dir="./runs",)import numpy as npfrom evaluate import loadmetric = load("accuracy")def compute_metrics(eval_pred):print(eval_pred)predictions, labels = eval_predpredictions = np.argmax(predictions, axis=1)return metric.compute(predictions=predictions, references=labels)from transformers import Trainertrainer = Trainer(model,training_args,train_dataset=subset_dataset_train,eval_dataset=subset_dataset_validation,tokenizer=tokenizer,compute_metrics=compute_metrics,)trainer.train()trainer.evaluate(eval_dataset=subset_dataset_test)trainer.create_model_card()trainer.push_to_hub()
CommitInfo(commit_url='https://huggingface.co/Maximofn/GPT2-small-PEFT-LoRA-finetuned-amazon-reviews-en-classification/commit/839066c2bde02689a6b3f5624ac25f89c4de217d', commit_message='End of training', commit_description='', oid='839066c2bde02689a6b3f5624ac25f89c4de217d', pr_url=None, pr_revision=None, pr_num=None)
PEFT-trained model test![link image 57](/icons/link.svg)
We clean as much as possible
import torchimport gcdef clear_hardwares():torch.clear_autocast_cache()torch.cuda.ipc_collect()torch.cuda.empty_cache()gc.collect()clear_hardwares()clear_hardwares()
As we have uploaded the model to our hub we can download it and use it.
import torchimport gcdef clear_hardwares():torch.clear_autocast_cache()torch.cuda.ipc_collect()torch.cuda.empty_cache()gc.collect()clear_hardwares()clear_hardwares()from transformers import pipelineuser = "maximofn"checkpoints = f"{user}/{model_name}"task = "text-classification"classifier = pipeline(task, model=checkpoints, tokenizer=checkpoints)
Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at openai-community/gpt2 and are newly initialized: ['score.weight']You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Now if we want to return the probability of all classes, we simply use the classifier we just instantiated, with the parameter top_k=None
labels = classifier("I love this product", top_k=None)labels
[{'label': 'LABEL_1', 'score': 0.9979197382926941},{'label': 'LABEL_0', 'score': 0.002080311067402363}]
If we only want the class with the highest probability we do the same but with the parameter top_k=1
label = classifier("I love this product", top_k=1)label
[{'label': 'LABEL_1', 'score': 0.9979197382926941}]
And if we want n classes we do the same but with the parameter top_k=n
two_labels = classifier("I love this product", top_k=2)two_labels
[{'label': 'LABEL_1', 'score': 0.9979197382926941},{'label': 'LABEL_0', 'score': 0.002080311067402363}]
If you want to test the model further you can see it in Maximofn/GPT2-small-PEFT-LoRA-finetuned-amazon-reviews-en-classification