Fine tuning SMLs

14 of july of 2024

Fine tuning SMLs with Hugging Face

In this post we are going to see how to fine tune small language models, we are going to see how to fine tune text classification and text generation. First we are going to see how to do it with the Hugging Face libraries, since Hugging Face has become a very important player in the AI ecosystem right now.

But although Hugging Face libraries are very important and useful, it is very important to know how the training is actually done and what is going on underneath, so let's repeat the training for classification and text generation but with Pytorch

This notebook has been automatically translated to make it accessible to more people, please let me know if you see any typos.

Fine tuning for text classification with Hugging Face

To be able to upload the training result to the hub we must first log in, for this we need a token

To create a token, go to the setings/tokens page of your account, you will see something like this

User-Access-Token-dark

Click on New token and a window will appear to create a new token.

new-token-dark

We name the token and create it with the write role, or with the Fine-grained role, which allows us to select exactly what permissions the token will have.

Once created, we copy and paste it below

	
		from huggingface_hub import notebook_login
      notebook_login()

Dataset

Now we download a dataset, in this case we are going to download one of reviews from Amazon

	
		from huggingface_hub import notebook_login
      notebook_login()
from datasets import load_dataset
      
      dataset = load_dataset("mteb/amazon_reviews_multi", "en")

Let's take a look at it

	
		from huggingface_hub import notebook_login
notebook_login()
from datasets import load_dataset
dataset = load_dataset("mteb/amazon_reviews_multi", "en")
dataset

	
		DatasetDict({
    train: Dataset({
        features: ['id', 'text', 'label', 'label_text'],
        num_rows: 200000
    })
    validation: Dataset({
        features: ['id', 'text', 'label', 'label_text'],
        num_rows: 5000
    })
    test: Dataset({
        features: ['id', 'text', 'label', 'label_text'],
        num_rows: 5000
    })
})

We see that you have a training set with 200,000 samples, a validation set with 5,000 samples and a test set with 5,000 samples.

Let's see an example of the training set

	
		from random import randint
idx = randint(0, len(dataset['train']) - 1)
dataset['train'][idx]

	
		{'id': 'en_0907914',
 'text': 'Mixed with fir it’s passable

Not the scent I had hoped for . Love the scent of cedar, but this one missed',
 'label': 3,
 'label_text': '3'}

We see that it has the review in the text field and the score given by the user in the label field.

As we are going to make a text classification model, we need to know how many classes we are going to have

	
		num_classes = len(dataset['train'].unique('label'))
num_classes

We are going to have 5 classes, now we are going to see the minimum value of these classes to know if the score starts at 0 or 1. For this we use the unique method

	
		dataset.unique('label')

	
		{'train': [0, 1, 2, 3, 4],
 'validation': [0, 1, 2, 3, 4],
 'test': [0, 1, 2, 3, 4]}

The minimum value will be 0

To train, the labels have to be in a field called labels, while in our dataset it is in a field called label, so we create the new field lables with the same value as label.

We create a function that does what we want

	
		def set_labels(example):
          example['labels'] = example['label']
          return example

We apply the function to the dataset

	
		def set_labels(example):
          example['labels'] = example['label']
          return example
dataset = dataset.map(set_labels)

Let's see how the dataset looks like

	
		def set_labels(example):
    example['labels'] = example['label']
    return example
dataset = dataset.map(set_labels)
dataset['train'][idx]

	
		{'id': 'en_0907914',
 'text': 'Mixed with fir it’s passable

Not the scent I had hoped for . Love the scent of cedar, but this one missed',
 'label': 3,
 'label_text': '3',
 'labels': 3}

Tokenizer

As in the dataset we have the reviews in text, we need to tokenize them in order to put the tokens into the model.

	
		from transformers import AutoTokenizer
      
      checkpoint = "openai-community/gpt2"
      tokenizer = AutoTokenizer.from_pretrained(checkpoint)

Now we create a function to tokenize the text. We are going to make it so that all sentences have the same length, so that the tokenizer will truncate when necessary and add padding tokens when necessary. We also tell it to return pytorch tensors.

We make the length of each statement 768 tokens because we are using the small GPT2 model, which as we saw in the GPT2 post has an embedding dimension of 768 tokens.

	
		from transformers import AutoTokenizer
      
      checkpoint = "openai-community/gpt2"
      tokenizer = AutoTokenizer.from_pretrained(checkpoint)
def tokenize_function(examples):
          return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=768, return_tensors="pt")

Let's try tokenizing a text

	
		from transformers import AutoTokenizer
checkpoint = "openai-community/gpt2"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=768, return_tensors="pt")
tokens = tokenize_function(dataset['train'][idx])

	
		---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[11], line 1
----> 1 tokens = tokenize_function(dataset['train'][idx])
Cell In[10], line 2, in tokenize_function(examples)
      1 def tokenize_function(examples):
----> 2     return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=768, return_tensors="pt")
File ~/miniconda3/envs/nlp_/lib/python3.11/site-packages/transformers/tokenization_utils_base.py:2883, in PreTrainedTokenizerBase.__call__(self, text, text_pair, text_target, text_pair_target, add_special_tokens, padding, truncation, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
   2881     if not self._in_target_context_manager:
   2882         self._switch_to_input_mode()
-> 2883     encodings = self._call_one(text=text, text_pair=text_pair, **all_kwargs)
   2884 if text_target is not None:
   2885     self._switch_to_target_mode()
File ~/miniconda3/envs/nlp_/lib/python3.11/site-packages/transformers/tokenization_utils_base.py:2989, in PreTrainedTokenizerBase._call_one(self, text, text_pair, add_special_tokens, padding, truncation, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
   2969     return self.batch_encode_plus(
   2970         batch_text_or_text_pairs=batch_text_or_text_pairs,
   2971         add_special_tokens=add_special_tokens,
   (...)
   2986         **kwargs,
   2987     )
   2988 else:
-> 2989     return self.encode_plus(
   2990         text=text,
   2991         text_pair=text_pair,
   2992         add_special_tokens=add_special_tokens,
   2993         padding=padding,
   2994         truncation=truncation,
   2995         max_length=max_length,
   2996         stride=stride,
   2997         is_split_into_words=is_split_into_words,
   2998         pad_to_multiple_of=pad_to_multiple_of,
   2999         return_tensors=return_tensors,
   3000         return_token_type_ids=return_token_type_ids,
   3001         return_attention_mask=return_attention_mask,
   3002         return_overflowing_tokens=return_overflowing_tokens,
   3003         return_special_tokens_mask=return_special_tokens_mask,
   3004         return_offsets_mapping=return_offsets_mapping,
   3005         return_length=return_length,
   3006         verbose=verbose,
   3007         **kwargs,
   3008     )
File ~/miniconda3/envs/nlp_/lib/python3.11/site-packages/transformers/tokenization_utils_base.py:3053, in PreTrainedTokenizerBase.encode_plus(self, text, text_pair, add_special_tokens, padding, truncation, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
   3032 """
   3033 Tokenize and prepare for the model a sequence or a pair of sequences.
   3034 
   (...)
   3049         method).
   3050 """
   3052 # Backward compatibility for 'truncation_strategy', 'pad_to_max_length'
-> 3053 padding_strategy, truncation_strategy, max_length, kwargs = self._get_padding_truncation_strategies(
   3054     padding=padding,
   3055     truncation=truncation,
   3056     max_length=max_length,
   3057     pad_to_multiple_of=pad_to_multiple_of,
   3058     verbose=verbose,
   3059     **kwargs,
   3060 )
   3062 return self._encode_plus(
   3063     text=text,
   3064     text_pair=text_pair,
   (...)
   3080     **kwargs,
   3081 )
File ~/miniconda3/envs/nlp_/lib/python3.11/site-packages/transformers/tokenization_utils_base.py:2788, in PreTrainedTokenizerBase._get_padding_truncation_strategies(self, padding, truncation, max_length, pad_to_multiple_of, verbose, **kwargs)
   2786 # Test if we have a padding token
   2787 if padding_strategy != PaddingStrategy.DO_NOT_PAD and (self.pad_token is None or self.pad_token_id < 0):
-> 2788     raise ValueError(
   2789         "Asking to pad but the tokenizer does not have a padding token. "
   2790         "Please select a token to use as `pad_token` `(tokenizer.pad_token = tokenizer.eos_token e.g.)` "
   2791         "or add a new pad token via `tokenizer.add_special_tokens({'pad_token': '[PAD]'})`."
   2792     )
   2794 # Check that we will truncate to a multiple of pad_to_multiple_of if both are provided
   2795 if (
   2796     truncation_strategy != TruncationStrategy.DO_NOT_TRUNCATE
   2797     and padding_strategy != PaddingStrategy.DO_NOT_PAD
   (...)
   2800     and (max_length % pad_to_multiple_of != 0)
   2801 ):
ValueError: Asking to pad but the tokenizer does not have a padding token. Please select a token to use as `pad_token` `(tokenizer.pad_token = tokenizer.eos_token e.g.)` or add a new pad token via `tokenizer.add_special_tokens({'pad_token': '[PAD]'})`.

We get an error because the GPT2 tokenizer does not have a token for padding and asks us to assign one, it also suggests us to do tokenizer.pad_token = tokenizer.eos_token, so we do it

	
		tokenizer.pad_token = tokenizer.eos_token

We test again the tokenization function

	
		tokenizer.pad_token = tokenizer.eos_token
tokens = tokenize_function(dataset['train'][idx])
tokens['input_ids'].shape, tokens['attention_mask'].shape

	
		(torch.Size([1, 768]), torch.Size([1, 768]))

Now that we have checked that the function tokenizes well, we apply this function to the dataset, but we also apply it in batches so that it runs faster

We also take advantage of this and eliminate the columns we do not need

	
		dataset = dataset.map(tokenize_function, batched=True, remove_columns=['text', 'label', 'id', 'label_text'])

Now let's see how the dataset looks like

	
		dataset = dataset.map(tokenize_function, batched=True, remove_columns=['text', 'label', 'id', 'label_text'])
dataset

	
		DatasetDict({
    train: Dataset({
        features: ['id', 'text', 'label', 'label_text', 'labels'],
        num_rows: 200000
    })
    validation: Dataset({
        features: ['id', 'text', 'label', 'label_text', 'labels'],
        num_rows: 5000
    })
    test: Dataset({
        features: ['id', 'text', 'label', 'label_text', 'labels'],
        num_rows: 5000
    })
})

We can see that we have the fields 'labels', 'input_ids' and 'attention_mask', which is what we are interested in training

Model

We instantiate a model for sequence classification and we indicate the number of classes we have

	
		from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=num_classes)

	
		Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at openai-community/gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

It tells us that the weights of the score layer have been initialized randomly and that we have to retrain them, let's see why this happens

The GPT2 model would look like this

	
		from transformers import AutoModelForCausalLM
      
      casual_model = AutoModelForCausalLM.from_pretrained(checkpoint)

While the GPT2 model for generating text is as follows

Let's take a look at its architecture

	
		from transformers import AutoModelForCausalLM
casual_model = AutoModelForCausalLM.from_pretrained(checkpoint)
casual_model

	
		GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)

And now the architecture of the model we are going to use to classify the reviews

	
		model

	
		GPT2ForSequenceClassification(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (score): Linear(in_features=768, out_features=5, bias=False)
)

There are two things to mention here

The first is that in both, the first layer has dimensions of 50257x768, which corresponds to 50257 possible tokens of the GPT2 vocabulary and 768 dimensions of the embedding, so we have done well to tokenize the reviews with a size of 768 tokens.
The second is that the casual model (the text generation model) has at the end a Linear layer that generates 50257 values, that is, it is in charge of predicting the next token and gives a value to each possible token. While the classification model has a Linear layer that only generates 5 values, one for each class, which will give us the probability that the review belongs to each class.

That is why we got the message that the weights of the score layer had been initialized randomly, because the transformers library has removed the Linear layer of 768x50257 and added a Linear layer of 768x5, it has initialized it with random values and we have to train it for our particular problem.

We delete the casual model, because we are not going to use it.

	
		del casual_model

Trainer

Let's now configure the training arguments

	
		del casual_model
from transformers import TrainingArguments
      
      metric_name = "accuracy"
      model_name = "GPT2-small-finetuned-amazon-reviews-en-classification"
      LR = 2e-5
      BS_TRAIN = 28
      BS_EVAL = 40
      EPOCHS = 3
      WEIGHT_DECAY = 0.01
      
      training_args = TrainingArguments(
          model_name,
          eval_strategy="epoch",
          save_strategy="epoch",
          learning_rate=LR,
          per_device_train_batch_size=BS_TRAIN,
          per_device_eval_batch_size=BS_EVAL,
          num_train_epochs=EPOCHS,
          weight_decay=WEIGHT_DECAY,
          lr_scheduler_type="cosine",
          warmup_ratio = 0.1,
          fp16=True,
          load_best_model_at_end=True,
          metric_for_best_model=metric_name,
          push_to_hub=True,
      )

Define a metric for the validation dataloader

	
		del casual_model
from transformers import TrainingArguments
      
      metric_name = "accuracy"
      model_name = "GPT2-small-finetuned-amazon-reviews-en-classification"
      LR = 2e-5
      BS_TRAIN = 28
      BS_EVAL = 40
      EPOCHS = 3
      WEIGHT_DECAY = 0.01
      
      training_args = TrainingArguments(
          model_name,
          eval_strategy="epoch",
          save_strategy="epoch",
          learning_rate=LR,
          per_device_train_batch_size=BS_TRAIN,
          per_device_eval_batch_size=BS_EVAL,
          num_train_epochs=EPOCHS,
          weight_decay=WEIGHT_DECAY,
          lr_scheduler_type="cosine",
          warmup_ratio = 0.1,
          fp16=True,
          load_best_model_at_end=True,
          metric_for_best_model=metric_name,
          push_to_hub=True,
      )
import numpy as np
      from evaluate import load
      
      metric = load("accuracy")
      
      def compute_metrics(eval_pred):
          print(eval_pred)
          predictions, labels = eval_pred
          predictions = np.argmax(predictions, axis=1)
          return metric.compute(predictions=predictions, references=labels)

We now define the trainer

	
		del casual_model
from transformers import TrainingArguments
      
      metric_name = "accuracy"
      model_name = "GPT2-small-finetuned-amazon-reviews-en-classification"
      LR = 2e-5
      BS_TRAIN = 28
      BS_EVAL = 40
      EPOCHS = 3
      WEIGHT_DECAY = 0.01
      
      training_args = TrainingArguments(
          model_name,
          eval_strategy="epoch",
          save_strategy="epoch",
          learning_rate=LR,
          per_device_train_batch_size=BS_TRAIN,
          per_device_eval_batch_size=BS_EVAL,
          num_train_epochs=EPOCHS,
          weight_decay=WEIGHT_DECAY,
          lr_scheduler_type="cosine",
          warmup_ratio = 0.1,
          fp16=True,
          load_best_model_at_end=True,
          metric_for_best_model=metric_name,
          push_to_hub=True,
      )
import numpy as np
      from evaluate import load
      
      metric = load("accuracy")
      
      def compute_metrics(eval_pred):
          print(eval_pred)
          predictions, labels = eval_pred
          predictions = np.argmax(predictions, axis=1)
          return metric.compute(predictions=predictions, references=labels)
from transformers import Trainer
      
      trainer = Trainer(
          model,
          training_args,
          train_dataset=dataset['train'],
          eval_dataset=dataset['validation'],
          tokenizer=tokenizer,
          compute_metrics=compute_metrics,
      )

We train

trainer.train()

  0%|          | 0/600000 [00:00<?, ?it/s]

      ---------------------------------------------------------------------------
      ValueError                                Traceback (most recent call last)
      Cell In[21], line 1
      ----> 1 trainer.train()
      
      File ~/miniconda3/envs/nlp_/lib/python3.11/site-packages/transformers/trainer.py:1876, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
         1873 try:
         1874     # Disable progress bars when uploading models during checkpoints to avoid polluting stdout
         1875     hf_hub_utils.disable_progress_bars()
      -> 1876     return inner_training_loop(
         1877         args=args,
         1878         resume_from_checkpoint=resume_from_checkpoint,
         1879         trial=trial,
         1880         ignore_keys_for_eval=ignore_keys_for_eval,
         1881     )
         1882 finally:
         1883     hf_hub_utils.enable_progress_bars()
      
      File ~/miniconda3/envs/nlp_/lib/python3.11/site-packages/transformers/trainer.py:2178, in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
         2175     rng_to_sync = True
         2177 step = -1
      -> 2178 for step, inputs in enumerate(epoch_iterator):
         2179     total_batched_samples += 1
         2181     if self.args.include_num_input_tokens_seen:
      
      File ~/miniconda3/envs/nlp_/lib/python3.11/site-packages/accelerate/data_loader.py:454, in DataLoaderShard.__iter__(self)
          452 # We iterate one batch ahead to check when we are at the end
          453 try:
      --> 454     current_batch = next(dataloader_iter)
          455 except StopIteration:
          456     yield
      
      File ~/miniconda3/envs/nlp_/lib/python3.11/site-packages/torch/utils/data/dataloader.py:631, in _BaseDataLoaderIter.__next__(self)
          628 if self._sampler_iter is None:
          629     # TODO(https://github.com/pytorch/pytorch/issues/76750)
          630     self._reset()  # type: ignore[call-arg]
      --> 631 data = self._next_data()
          632 self._num_yielded += 1
          633 if self._dataset_kind == _DatasetKind.Iterable and \
          634         self._IterableDataset_len_called is not None and \
          635         self._num_yielded > self._IterableDataset_len_called:
      
      File ~/miniconda3/envs/nlp_/lib/python3.11/site-packages/torch/utils/data/dataloader.py:675, in _SingleProcessDataLoaderIter._next_data(self)
          673 def _next_data(self):
          674     index = self._next_index()  # may raise StopIteration
      --> 675     data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
          676     if self._pin_memory:
          677         data = _utils.pin_memory.pin_memory(data, self._pin_memory_device)
      
      File ~/miniconda3/envs/nlp_/lib/python3.11/site-packages/torch/utils/data/_utils/fetch.py:54, in _MapDatasetFetcher.fetch(self, possibly_batched_index)
           52 else:
           53     data = self.dataset[possibly_batched_index]
      ---> 54 return self.collate_fn(data)
      
      File ~/miniconda3/envs/nlp_/lib/python3.11/site-packages/transformers/data/data_collator.py:271, in DataCollatorWithPadding.__call__(self, features)
          270 def __call__(self, features: List[Dict[str, Any]]) -> Dict[str, Any]:
      --> 271     batch = pad_without_fast_tokenizer_warning(
          272         self.tokenizer,
          273         features,
          274         padding=self.padding,
          275         max_length=self.max_length,
          276         pad_to_multiple_of=self.pad_to_multiple_of,
          277         return_tensors=self.return_tensors,
          278     )
          279     if "label" in batch:
          280         batch["labels"] = batch["label"]
      
      File ~/miniconda3/envs/nlp_/lib/python3.11/site-packages/transformers/data/data_collator.py:66, in pad_without_fast_tokenizer_warning(tokenizer, *pad_args, **pad_kwargs)
           63 tokenizer.deprecation_warnings["Asking-to-pad-a-fast-tokenizer"] = True
           65 try:
      ---> 66     padded = tokenizer.pad(*pad_args, **pad_kwargs)
           67 finally:
           68     # Restore the state of the warning.
           69     tokenizer.deprecation_warnings["Asking-to-pad-a-fast-tokenizer"] = warning_state
      
      File ~/miniconda3/envs/nlp_/lib/python3.11/site-packages/transformers/tokenization_utils_base.py:3299, in PreTrainedTokenizerBase.pad(self, encoded_inputs, padding, max_length, pad_to_multiple_of, return_attention_mask, return_tensors, verbose)
         3297 # The model's main input name, usually `input_ids`, has be passed for padding
         3298 if self.model_input_names[0] not in encoded_inputs:
      -> 3299     raise ValueError(
         3300         "You should supply an encoding or a list of encodings to this method "
         3301         f"that includes {self.model_input_names[0]}, but you provided {list(encoded_inputs.keys())}"
         3302     )
         3304 required_input = encoded_inputs[self.model_input_names[0]]
         3306 if required_input is None or (isinstance(required_input, Sized) and len(required_input) == 0):
      
      ValueError: You should supply an encoding or a list of encodings to this method that includes input_ids, but you provided ['label', 'labels']

We get an error again because the model does not have a padding token assigned, so as with the tokenizer we assign it to it

	
		del casual_model
from transformers import TrainingArguments
      
      metric_name = "accuracy"
      model_name = "GPT2-small-finetuned-amazon-reviews-en-classification"
      LR = 2e-5
      BS_TRAIN = 28
      BS_EVAL = 40
      EPOCHS = 3
      WEIGHT_DECAY = 0.01
      
      training_args = TrainingArguments(
          model_name,
          eval_strategy="epoch",
          save_strategy="epoch",
          learning_rate=LR,
          per_device_train_batch_size=BS_TRAIN,
          per_device_eval_batch_size=BS_EVAL,
          num_train_epochs=EPOCHS,
          weight_decay=WEIGHT_DECAY,
          lr_scheduler_type="cosine",
          warmup_ratio = 0.1,
          fp16=True,
          load_best_model_at_end=True,
          metric_for_best_model=metric_name,
          push_to_hub=True,
      )
import numpy as np
      from evaluate import load
      
      metric = load("accuracy")
      
      def compute_metrics(eval_pred):
          print(eval_pred)
          predictions, labels = eval_pred
          predictions = np.argmax(predictions, axis=1)
          return metric.compute(predictions=predictions, references=labels)
from transformers import Trainer
      
      trainer = Trainer(
          model,
          training_args,
          train_dataset=dataset['train'],
          eval_dataset=dataset['validation'],
          tokenizer=tokenizer,
          compute_metrics=compute_metrics,
      )
trainer.train()
model.config.pad_token_id = model.config.eos_token_id

We recreate the trainer arguments with the new model, which now has a padding token, the trainer and train again.

	
		del casual_model
from transformers import TrainingArguments
      
      metric_name = "accuracy"
      model_name = "GPT2-small-finetuned-amazon-reviews-en-classification"
      LR = 2e-5
      BS_TRAIN = 28
      BS_EVAL = 40
      EPOCHS = 3
      WEIGHT_DECAY = 0.01
      
      training_args = TrainingArguments(
          model_name,
          eval_strategy="epoch",
          save_strategy="epoch",
          learning_rate=LR,
          per_device_train_batch_size=BS_TRAIN,
          per_device_eval_batch_size=BS_EVAL,
          num_train_epochs=EPOCHS,
          weight_decay=WEIGHT_DECAY,
          lr_scheduler_type="cosine",
          warmup_ratio = 0.1,
          fp16=True,
          load_best_model_at_end=True,
          metric_for_best_model=metric_name,
          push_to_hub=True,
      )
import numpy as np
      from evaluate import load
      
      metric = load("accuracy")
      
      def compute_metrics(eval_pred):
          print(eval_pred)
          predictions, labels = eval_pred
          predictions = np.argmax(predictions, axis=1)
          return metric.compute(predictions=predictions, references=labels)
from transformers import Trainer
      
      trainer = Trainer(
          model,
          training_args,
          train_dataset=dataset['train'],
          eval_dataset=dataset['validation'],
          tokenizer=tokenizer,
          compute_metrics=compute_metrics,
      )
trainer.train()
model.config.pad_token_id = model.config.eos_token_id
training_args = TrainingArguments(
          model_name,
          eval_strategy="epoch",
          save_strategy="epoch",
          learning_rate=LR,
          per_device_train_batch_size=BS_TRAIN,
          per_device_eval_batch_size=BS_EVAL,
          num_train_epochs=EPOCHS,
          weight_decay=WEIGHT_DECAY,
          lr_scheduler_type="cosine",
          warmup_ratio = 0.1,
          fp16=True,
          load_best_model_at_end=True,
          metric_for_best_model=metric_name,
          push_to_hub=True,
          logging_dir="./runs",
      )
      
      trainer = Trainer(
          model,
          training_args,
          train_dataset=dataset['train'],
          eval_dataset=dataset['validation'],
          tokenizer=tokenizer,
          compute_metrics=compute_metrics,
      )

Now that we have seen that all is well, we can train.

trainer.train()

[ 2250/21429 45:38 < 6:29:27, 0.82 it/s, Epoch 0.31/3]

Epoch	Training Loss	Validation Loss

[21429/21429 7:19:25, Epoch 3/3]

Epoch	Training Loss	Validation Loss	Accuracy
1	0.807400	0.820341	0.652000
2	0.751900	0.802189	0.654600
3	0.718100	0.810221	0.657800

<transformers.trainer_utils.EvalPrediction object at 0x782767ea1450>
      <transformers.trainer_utils.EvalPrediction object at 0x782767eeefe0>
      <transformers.trainer_utils.EvalPrediction object at 0x782767eecfd0>

Out[28]:

TrainOutput(global_step=21429, training_loss=0.7846888848762739, metrics={'train_runtime': 26367.7801, 'train_samples_per_second': 22.755, 'train_steps_per_second': 0.813, 'total_flos': 2.35173445632e+17, 'train_loss': 0.7846888848762739, 'epoch': 3.0})

Evaluation

Once trained we evaluate on the test dataset

trainer.evaluate(eval_dataset=dataset['test'])

[125/125 01:15]

<transformers.trainer_utils.EvalPrediction object at 0x7826ddfded40>

Out[29]:

{'eval_loss': 0.7973636984825134,
       'eval_accuracy': 0.6626,
       'eval_runtime': 76.3016,
       'eval_samples_per_second': 65.529,
       'eval_steps_per_second': 1.638,
       'epoch': 3.0}

Publish the model

Now that we have our model trained, we can share it with the world, so first we create a model card.

	
		del casual_model
from transformers import TrainingArguments
      
      metric_name = "accuracy"
      model_name = "GPT2-small-finetuned-amazon-reviews-en-classification"
      LR = 2e-5
      BS_TRAIN = 28
      BS_EVAL = 40
      EPOCHS = 3
      WEIGHT_DECAY = 0.01
      
      training_args = TrainingArguments(
          model_name,
          eval_strategy="epoch",
          save_strategy="epoch",
          learning_rate=LR,
          per_device_train_batch_size=BS_TRAIN,
          per_device_eval_batch_size=BS_EVAL,
          num_train_epochs=EPOCHS,
          weight_decay=WEIGHT_DECAY,
          lr_scheduler_type="cosine",
          warmup_ratio = 0.1,
          fp16=True,
          load_best_model_at_end=True,
          metric_for_best_model=metric_name,
          push_to_hub=True,
      )
import numpy as np
      from evaluate import load
      
      metric = load("accuracy")
      
      def compute_metrics(eval_pred):
          print(eval_pred)
          predictions, labels = eval_pred
          predictions = np.argmax(predictions, axis=1)
          return metric.compute(predictions=predictions, references=labels)
from transformers import Trainer
      
      trainer = Trainer(
          model,
          training_args,
          train_dataset=dataset['train'],
          eval_dataset=dataset['validation'],
          tokenizer=tokenizer,
          compute_metrics=compute_metrics,
      )
trainer.train()
model.config.pad_token_id = model.config.eos_token_id
training_args = TrainingArguments(
          model_name,
          eval_strategy="epoch",
          save_strategy="epoch",
          learning_rate=LR,
          per_device_train_batch_size=BS_TRAIN,
          per_device_eval_batch_size=BS_EVAL,
          num_train_epochs=EPOCHS,
          weight_decay=WEIGHT_DECAY,
          lr_scheduler_type="cosine",
          warmup_ratio = 0.1,
          fp16=True,
          load_best_model_at_end=True,
          metric_for_best_model=metric_name,
          push_to_hub=True,
          logging_dir="./runs",
      )
      
      trainer = Trainer(
          model,
          training_args,
          train_dataset=dataset['train'],
          eval_dataset=dataset['validation'],
          tokenizer=tokenizer,
          compute_metrics=compute_metrics,
      )
trainer.train()
trainer.evaluate(eval_dataset=dataset['test'])
trainer.create_model_card()

And now we can publish it. As the first thing we have done is to log in with the huggingface hub, we can upload it to our hub without any problem.

	
		del casual_model
from transformers import TrainingArguments
      
      metric_name = "accuracy"
      model_name = "GPT2-small-finetuned-amazon-reviews-en-classification"
      LR = 2e-5
      BS_TRAIN = 28
      BS_EVAL = 40
      EPOCHS = 3
      WEIGHT_DECAY = 0.01
      
      training_args = TrainingArguments(
          model_name,
          eval_strategy="epoch",
          save_strategy="epoch",
          learning_rate=LR,
          per_device_train_batch_size=BS_TRAIN,
          per_device_eval_batch_size=BS_EVAL,
          num_train_epochs=EPOCHS,
          weight_decay=WEIGHT_DECAY,
          lr_scheduler_type="cosine",
          warmup_ratio = 0.1,
          fp16=True,
          load_best_model_at_end=True,
          metric_for_best_model=metric_name,
          push_to_hub=True,
      )
import numpy as np
      from evaluate import load
      
      metric = load("accuracy")
      
      def compute_metrics(eval_pred):
          print(eval_pred)
          predictions, labels = eval_pred
          predictions = np.argmax(predictions, axis=1)
          return metric.compute(predictions=predictions, references=labels)
from transformers import Trainer
      
      trainer = Trainer(
          model,
          training_args,
          train_dataset=dataset['train'],
          eval_dataset=dataset['validation'],
          tokenizer=tokenizer,
          compute_metrics=compute_metrics,
      )
trainer.train()
model.config.pad_token_id = model.config.eos_token_id
training_args = TrainingArguments(
          model_name,
          eval_strategy="epoch",
          save_strategy="epoch",
          learning_rate=LR,
          per_device_train_batch_size=BS_TRAIN,
          per_device_eval_batch_size=BS_EVAL,
          num_train_epochs=EPOCHS,
          weight_decay=WEIGHT_DECAY,
          lr_scheduler_type="cosine",
          warmup_ratio = 0.1,
          fp16=True,
          load_best_model_at_end=True,
          metric_for_best_model=metric_name,
          push_to_hub=True,
          logging_dir="./runs",
      )
      
      trainer = Trainer(
          model,
          training_args,
          train_dataset=dataset['train'],
          eval_dataset=dataset['validation'],
          tokenizer=tokenizer,
          compute_metrics=compute_metrics,
      )
trainer.train()
trainer.evaluate(eval_dataset=dataset['test'])
trainer.create_model_card()
trainer.push_to_hub()

Model use

We clean as much as possible

	
		del casual_model
from transformers import TrainingArguments
      
      metric_name = "accuracy"
      model_name = "GPT2-small-finetuned-amazon-reviews-en-classification"
      LR = 2e-5
      BS_TRAIN = 28
      BS_EVAL = 40
      EPOCHS = 3
      WEIGHT_DECAY = 0.01
      
      training_args = TrainingArguments(
          model_name,
          eval_strategy="epoch",
          save_strategy="epoch",
          learning_rate=LR,
          per_device_train_batch_size=BS_TRAIN,
          per_device_eval_batch_size=BS_EVAL,
          num_train_epochs=EPOCHS,
          weight_decay=WEIGHT_DECAY,
          lr_scheduler_type="cosine",
          warmup_ratio = 0.1,
          fp16=True,
          load_best_model_at_end=True,
          metric_for_best_model=metric_name,
          push_to_hub=True,
      )
import numpy as np
      from evaluate import load
      
      metric = load("accuracy")
      
      def compute_metrics(eval_pred):
          print(eval_pred)
          predictions, labels = eval_pred
          predictions = np.argmax(predictions, axis=1)
          return metric.compute(predictions=predictions, references=labels)
from transformers import Trainer
      
      trainer = Trainer(
          model,
          training_args,
          train_dataset=dataset['train'],
          eval_dataset=dataset['validation'],
          tokenizer=tokenizer,
          compute_metrics=compute_metrics,
      )
trainer.train()
model.config.pad_token_id = model.config.eos_token_id
training_args = TrainingArguments(
          model_name,
          eval_strategy="epoch",
          save_strategy="epoch",
          learning_rate=LR,
          per_device_train_batch_size=BS_TRAIN,
          per_device_eval_batch_size=BS_EVAL,
          num_train_epochs=EPOCHS,
          weight_decay=WEIGHT_DECAY,
          lr_scheduler_type="cosine",
          warmup_ratio = 0.1,
          fp16=True,
          load_best_model_at_end=True,
          metric_for_best_model=metric_name,
          push_to_hub=True,
          logging_dir="./runs",
      )
      
      trainer = Trainer(
          model,
          training_args,
          train_dataset=dataset['train'],
          eval_dataset=dataset['validation'],
          tokenizer=tokenizer,
          compute_metrics=compute_metrics,
      )
trainer.train()
trainer.evaluate(eval_dataset=dataset['test'])
trainer.create_model_card()
trainer.push_to_hub()
import torch
      import gc
      
      
      def clear_hardwares():
          torch.clear_autocast_cache()
          torch.cuda.ipc_collect()
          torch.cuda.empty_cache()
          gc.collect()
      
      
      clear_hardwares()
      clear_hardwares()

As we have uploaded the model to our hub we can download it and use it.

	
		del casual_model
from transformers import TrainingArguments
      
      metric_name = "accuracy"
      model_name = "GPT2-small-finetuned-amazon-reviews-en-classification"
      LR = 2e-5
      BS_TRAIN = 28
      BS_EVAL = 40
      EPOCHS = 3
      WEIGHT_DECAY = 0.01
      
      training_args = TrainingArguments(
          model_name,
          eval_strategy="epoch",
          save_strategy="epoch",
          learning_rate=LR,
          per_device_train_batch_size=BS_TRAIN,
          per_device_eval_batch_size=BS_EVAL,
          num_train_epochs=EPOCHS,
          weight_decay=WEIGHT_DECAY,
          lr_scheduler_type="cosine",
          warmup_ratio = 0.1,
          fp16=True,
          load_best_model_at_end=True,
          metric_for_best_model=metric_name,
          push_to_hub=True,
      )
import numpy as np
      from evaluate import load
      
      metric = load("accuracy")
      
      def compute_metrics(eval_pred):
          print(eval_pred)
          predictions, labels = eval_pred
          predictions = np.argmax(predictions, axis=1)
          return metric.compute(predictions=predictions, references=labels)
from transformers import Trainer
      
      trainer = Trainer(
          model,
          training_args,
          train_dataset=dataset['train'],
          eval_dataset=dataset['validation'],
          tokenizer=tokenizer,
          compute_metrics=compute_metrics,
      )
trainer.train()
model.config.pad_token_id = model.config.eos_token_id
training_args = TrainingArguments(
          model_name,
          eval_strategy="epoch",
          save_strategy="epoch",
          learning_rate=LR,
          per_device_train_batch_size=BS_TRAIN,
          per_device_eval_batch_size=BS_EVAL,
          num_train_epochs=EPOCHS,
          weight_decay=WEIGHT_DECAY,
          lr_scheduler_type="cosine",
          warmup_ratio = 0.1,
          fp16=True,
          load_best_model_at_end=True,
          metric_for_best_model=metric_name,
          push_to_hub=True,
          logging_dir="./runs",
      )
      
      trainer = Trainer(
          model,
          training_args,
          train_dataset=dataset['train'],
          eval_dataset=dataset['validation'],
          tokenizer=tokenizer,
          compute_metrics=compute_metrics,
      )
trainer.train()
trainer.evaluate(eval_dataset=dataset['test'])
trainer.create_model_card()
trainer.push_to_hub()
import torch
      import gc
      
      
      def clear_hardwares():
          torch.clear_autocast_cache()
          torch.cuda.ipc_collect()
          torch.cuda.empty_cache()
          gc.collect()
      
      
      clear_hardwares()
      clear_hardwares()
from transformers import pipeline
      
      user = "maximofn"
      checkpoints = f"{user}/{model_name}"
      task = "text-classification"
      classifier = pipeline(task, model=checkpoints, tokenizer=checkpoints)

Now if we want to return the probability of all classes, we simply use the classifier we just instantiated, with the parameter top_k=None.

	
		del casual_model
from transformers import TrainingArguments
metric_name = "accuracy"
model_name = "GPT2-small-finetuned-amazon-reviews-en-classification"
LR = 2e-5
BS_TRAIN = 28
BS_EVAL = 40
EPOCHS = 3
WEIGHT_DECAY = 0.01
training_args = TrainingArguments(
    model_name,
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=LR,
    per_device_train_batch_size=BS_TRAIN,
    per_device_eval_batch_size=BS_EVAL,
    num_train_epochs=EPOCHS,
    weight_decay=WEIGHT_DECAY,
    lr_scheduler_type="cosine",
    warmup_ratio = 0.1,
    fp16=True,
    load_best_model_at_end=True,
    metric_for_best_model=metric_name,
    push_to_hub=True,
)
import numpy as np
from evaluate import load
metric = load("accuracy")
def compute_metrics(eval_pred):
    print(eval_pred)
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return metric.compute(predictions=predictions, references=labels)
from transformers import Trainer
trainer = Trainer(
    model,
    training_args,
    train_dataset=dataset['train'],
    eval_dataset=dataset['validation'],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)
trainer.train()
model.config.pad_token_id = model.config.eos_token_id
training_args = TrainingArguments(
    model_name,
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=LR,
    per_device_train_batch_size=BS_TRAIN,
    per_device_eval_batch_size=BS_EVAL,
    num_train_epochs=EPOCHS,
    weight_decay=WEIGHT_DECAY,
    lr_scheduler_type="cosine",
    warmup_ratio = 0.1,
    fp16=True,
    load_best_model_at_end=True,
    metric_for_best_model=metric_name,
    push_to_hub=True,
    logging_dir="./runs",
)
trainer = Trainer(
    model,
    training_args,
    train_dataset=dataset['train'],
    eval_dataset=dataset['validation'],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)
trainer.train()
trainer.evaluate(eval_dataset=dataset['test'])
trainer.create_model_card()
trainer.push_to_hub()
import torch
import gc
def clear_hardwares():
    torch.clear_autocast_cache()
    torch.cuda.ipc_collect()
    torch.cuda.empty_cache()
    gc.collect()
clear_hardwares()
clear_hardwares()
from transformers import pipeline
user = "maximofn"
checkpoints = f"{user}/{model_name}"
task = "text-classification"
classifier = pipeline(task, model=checkpoints, tokenizer=checkpoints)
labels = classifier("I love this product", top_k=None)
labels

	
		  0%|          | 0/600000 [00:00<?, ?it/s]
[{'label': 'LABEL_4', 'score': 0.8253807425498962},
 {'label': 'LABEL_3', 'score': 0.15411493182182312},
 {'label': 'LABEL_2', 'score': 0.013907806016504765},
 {'label': 'LABEL_0', 'score': 0.003939222544431686},
 {'label': 'LABEL_1', 'score': 0.0026572425849735737}]

If we only want the class with the highest probability we do the same but with the parameter top_k=1.

	
		label = classifier("I love this product", top_k=1)
label

	
		[{'label': 'LABEL_4', 'score': 0.8253807425498962}]

And if we want n classes we do the same but with the parameter top_k=n.

	
		two_labels = classifier("I love this product", top_k=2)
two_labels

	
		[{'label': 'LABEL_4', 'score': 0.8253807425498962},
 {'label': 'LABEL_3', 'score': 0.15411493182182312}]

We can also test the model with Automodel and AutoTokenizer.

	
		from transformers import AutoTokenizer, AutoModelForSequenceClassification
      import torch
      
      model_name = "GPT2-small-finetuned-amazon-reviews-en-classification"
      user = "maximofn"
      checkpoint = f"{user}/{model_name}"
      num_classes = 5
      
      tokenizer = AutoTokenizer.from_pretrained(checkpoint)
      model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=num_classes).half().eval().to("cuda")

	
		from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model_name = "GPT2-small-finetuned-amazon-reviews-en-classification"
user = "maximofn"
checkpoint = f"{user}/{model_name}"
num_classes = 5
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=num_classes).half().eval().to("cuda")
tokens = tokenizer.encode("I love this product", return_tensors="pt").to(model.device)
with torch.no_grad():
    output = model(tokens)
logits = output.logits
lables = torch.softmax(logits, dim=1).cpu().numpy().tolist()
lables[0]

	
		[0.003963470458984375,
 0.0026721954345703125,
 0.01397705078125,
 0.154541015625,
 0.82470703125]

If you want to test the model further you can see it in Maximofn/GPT2-small-finetuned-amazon-reviews-en-classification

Fine tuning for text generation with Hugging Face

To make sure I don't have VRAM memory problems, I reboot the notebook.

In order to upload the training result to the hub we must first log in, for this we need a token

To create a token, go to the setings/tokens page of your account, you will see something like this

User-Access-Token-dark

Click on New token and a window will appear to create a new token.

new-token-dark

We name the token and create it with the write role, or with the Fine-grained role, which allows us to select exactly what permissions the token will have.

Once created, we copy and paste it below

	
		from huggingface_hub import notebook_login
      notebook_login()

Dataset

Let's use a dataset of English jokes

	
		from huggingface_hub import notebook_login
notebook_login()
from datasets import load_dataset
jokes = load_dataset("Maximofn/short-jokes-dataset")
jokes

	
		DatasetDict({
    train: Dataset({
        features: ['ID', 'Joke'],
        num_rows: 231657
    })
})

Let's take a look at it

	
		jokes

	
		DatasetDict({
    train: Dataset({
        features: ['ID', 'Joke'],
        num_rows: 231657
    })
})

We see that it is a single training set of more than 200 thousand jokes. So later we will have to divide it into training and evaluation.

Let's see a sample

	
		from random import randint
idx = randint(0, len(jokes['train']) - 1)
jokes['train'][idx]

	
		{'ID': 198387,
 'Joke': 'My hot dislexic co-worker said she had an important massage to give me in her office... When I got there, she told me it can wait until I put on some clothes.'}

We see that it has an ID of the joke that does not interest us at all and the joke itself.

In case you have low GPU memory I will make a subset of the dataset, choose the percentage of jokes you want to use.

	
		percent_of_train_dataset = 1 # If you want 50% of the dataset, set this to 0.5
subset_dataset = jokes["train"].select(range(int(len(jokes["train"]) * percent_of_train_dataset)))
subset_dataset

	
		Dataset({
    features: ['ID', 'Joke'],
    num_rows: 231657
})

We now divide the subset into a training set and a validation set.

	
		percent_of_train_dataset = 0.90
split_dataset = subset_dataset.train_test_split(train_size=int(subset_dataset.num_rows * percent_of_train_dataset), seed=19, shuffle=False)
train_dataset = split_dataset["train"]
validation_test_dataset = split_dataset["test"]
split_dataset = validation_test_dataset.train_test_split(train_size=int(validation_test_dataset.num_rows * 0.5), seed=19, shuffle=False)
validation_dataset = split_dataset["train"]
test_dataset = split_dataset["test"]
print(f"Size of the train set: {len(train_dataset)}. Size of the validation set: {len(validation_dataset)}. Size of the test set: {len(test_dataset)}")

	
		Size of the train set: 208491. Size of the validation set: 11583. Size of the test set: 11583

Tokenizer

Instantiate the tokenizer. We instantiate the padding token of the tokenizer so that it does not give us error as before.

	
		from transformers import AutoTokenizer
      
      checkpoints = "openai-community/gpt2"
      
      tokenizer = AutoTokenizer.from_pretrained(checkpoints)
      tokenizer.pad_token = tokenizer.eos_token
      tokenizer.padding_side = "right"

We are going to add two new tokens for joke start and joke end to have more control

	
		from transformers import AutoTokenizer
checkpoints = "openai-community/gpt2"
tokenizer = AutoTokenizer.from_pretrained(checkpoints)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
new_tokens = ['<SJ>', '<EJ>'] # Start and end of joke tokens
num_added_tokens = tokenizer.add_tokens(new_tokens)
print(f"Added {num_added_tokens} tokens")

	
		Added 2 tokens

We create a function to add the new tokens to the sentences

	
		joke_column = "Joke"
      
      def format_joke(example):
          example[joke_column] = '<SJ> ' + example['Joke'] + ' <EJ>'
          return example

Select the columns we do not need

	
		joke_column = "Joke"
def format_joke(example):
    example[joke_column] = '<SJ> ' + example['Joke'] + ' <EJ>'
    return example
remove_columns = [column for column in train_dataset.column_names if column != joke_column]
remove_columns

	
		['ID']

Format the dataset and delete the columns that are not needed

	
		train_dataset = train_dataset.map(format_joke, remove_columns=remove_columns)
validation_dataset = validation_dataset.map(format_joke, remove_columns=remove_columns)
test_dataset = test_dataset.map(format_joke, remove_columns=remove_columns)
train_dataset, validation_dataset, test_dataset

	
		(Dataset({
     features: ['Joke'],
     num_rows: 208491
 }),
 Dataset({
     features: ['Joke'],
     num_rows: 11583
 }),
 Dataset({
     features: ['Joke'],
     num_rows: 11583
 }))

Now we create a function to tokenize the jokes

	
		def tokenize_function(examples):
          return tokenizer(examples[joke_column], padding="max_length", truncation=True, max_length=768, return_tensors="pt")

We tokenize the dataset and delete the column with the text

	
		def tokenize_function(examples):
    return tokenizer(examples[joke_column], padding="max_length", truncation=True, max_length=768, return_tensors="pt")
train_dataset = train_dataset.map(tokenize_function, batched=True, remove_columns=[joke_column])
validation_dataset = validation_dataset.map(tokenize_function, batched=True, remove_columns=[joke_column])
test_dataset = test_dataset.map(tokenize_function, batched=True, remove_columns=[joke_column])
train_dataset, validation_dataset, test_dataset

	
		(Dataset({
     features: ['input_ids', 'attention_mask'],
     num_rows: 208491
 }),
 Dataset({
     features: ['input_ids', 'attention_mask'],
     num_rows: 11583
 }),
 Dataset({
     features: ['input_ids', 'attention_mask'],
     num_rows: 11583
 }))

Model

Now we instantiate the model for text generation and assign the end of string token to the pading token

	
		from transformers import AutoModelForCausalLM
      
      model = AutoModelForCausalLM.from_pretrained(checkpoints)
      model.config.pad_token_id = model.config.eos_token_id

We see the size of the model vocabulary

	
		from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(checkpoints)
model.config.pad_token_id = model.config.eos_token_id
vocab_size = model.config.vocab_size
vocab_size

It has 50257 tokens, which is the size of the GPT2 vocabulary. But since we said that we were going to create two new tokens with the start of joke and the end of joke, we added them to the model

	
		model.resize_token_embeddings(len(tokenizer))
new_vocab_size = model.config.vocab_size
print(f"Old vocab size: {vocab_size}. New vocab size: {new_vocab_size}. Added {new_vocab_size - vocab_size} tokens")

	
		Old vocab size: 50257. New vocab size: 50259. Added 2 tokens

Two new tokens have been added

Training

We configure the training parameters

	
		from transformers import TrainingArguments
      
      metric_name = "accuracy"
      model_name = "GPT2-small-finetuned-Maximofn-short-jokes-dataset-casualLM"
      output_dir = f"./training_results"
      LR = 2e-5
      BS_TRAIN = 28
      BS_EVAL = 32
      EPOCHS = 3
      WEIGHT_DECAY = 0.01
      WARMUP_STEPS = 100
      
      training_args = TrainingArguments(
          model_name,
          eval_strategy="epoch",
          save_strategy="epoch",
          learning_rate=LR,
          per_device_train_batch_size=BS_TRAIN,
          per_device_eval_batch_size=BS_EVAL,
          warmup_steps=WARMUP_STEPS,
          num_train_epochs=EPOCHS,
          weight_decay=WEIGHT_DECAY,
          lr_scheduler_type="cosine",
          warmup_ratio = 0.1,
          fp16=True,
          load_best_model_at_end=True,
          # metric_for_best_model=metric_name,
          push_to_hub=True,
      )

Now we do not use metric_for_best_model, after defining the trainer we explain why

We define the trainer

	
		from transformers import TrainingArguments
      
      metric_name = "accuracy"
      model_name = "GPT2-small-finetuned-Maximofn-short-jokes-dataset-casualLM"
      output_dir = f"./training_results"
      LR = 2e-5
      BS_TRAIN = 28
      BS_EVAL = 32
      EPOCHS = 3
      WEIGHT_DECAY = 0.01
      WARMUP_STEPS = 100
      
      training_args = TrainingArguments(
          model_name,
          eval_strategy="epoch",
          save_strategy="epoch",
          learning_rate=LR,
          per_device_train_batch_size=BS_TRAIN,
          per_device_eval_batch_size=BS_EVAL,
          warmup_steps=WARMUP_STEPS,
          num_train_epochs=EPOCHS,
          weight_decay=WEIGHT_DECAY,
          lr_scheduler_type="cosine",
          warmup_ratio = 0.1,
          fp16=True,
          load_best_model_at_end=True,
          # metric_for_best_model=metric_name,
          push_to_hub=True,
      )
from transformers import Trainer
      
      trainer = Trainer(
          model,
          training_args,
          train_dataset=train_dataset,
          eval_dataset=validation_dataset,
          tokenizer=tokenizer,
          # compute_metrics=compute_metrics,
      )

In this case we do not pass a compute_metrics function, if it is not passed, during the evaluation the loss will be used to evaluate the model. That is why when defining the arguments we do not define metric_for_best_model, because we are not going to use a metric to evaluate the model, but the loss.

We train

trainer.train()

  0%|          | 0/625473 [00:00<?, ?it/s]

      ---------------------------------------------------------------------------
      ValueError                                Traceback (most recent call last)
      Cell In[19], line 1
      ----> 1 trainer.train()
      
      File ~/miniconda3/envs/nlp_/lib/python3.11/site-packages/transformers/trainer.py:1885, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
         1883         hf_hub_utils.enable_progress_bars()
         1884 else:
      -> 1885     return inner_training_loop(
         1886         args=args,
         1887         resume_from_checkpoint=resume_from_checkpoint,
         1888         trial=trial,
         1889         ignore_keys_for_eval=ignore_keys_for_eval,
         1890     )
      
      File ~/miniconda3/envs/nlp_/lib/python3.11/site-packages/transformers/trainer.py:2216, in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
         2213     self.control = self.callback_handler.on_step_begin(args, self.state, self.control)
         2215 with self.accelerator.accumulate(model):
      -> 2216     tr_loss_step = self.training_step(model, inputs)
         2218 if (
         2219     args.logging_nan_inf_filter
         2220     and not is_torch_xla_available()
         2221     and (torch.isnan(tr_loss_step) or torch.isinf(tr_loss_step))
         2222 ):
         2223     # if loss is nan or inf simply add the average of previous logged losses
         2224     tr_loss += tr_loss / (1 + self.state.global_step - self._globalstep_last_logged)
      
      File ~/miniconda3/envs/nlp_/lib/python3.11/site-packages/transformers/trainer.py:3238, in Trainer.training_step(self, model, inputs)
         3235     return loss_mb.reduce_mean().detach().to(self.args.device)
         3237 with self.compute_loss_context_manager():
      -> 3238     loss = self.compute_loss(model, inputs)
         3240 del inputs
         3241 torch.cuda.empty_cache()
      
      File ~/miniconda3/envs/nlp_/lib/python3.11/site-packages/transformers/trainer.py:3282, in Trainer.compute_loss(self, model, inputs, return_outputs)
         3280 else:
         3281     if isinstance(outputs, dict) and "loss" not in outputs:
      -> 3282         raise ValueError(
         3283             "The model did not return a loss from the inputs, only the following keys: "
         3284             f"{','.join(outputs.keys())}. For reference, the inputs it received are {','.join(inputs.keys())}."
         3285         )
         3286     # We don't use .loss here since the model may return tuples instead of ModelOutput.
         3287     loss = outputs["loss"] if isinstance(outputs, dict) else outputs[0]
      
      ValueError: The model did not return a loss from the inputs, only the following keys: logits,past_key_values. For reference, the inputs it received are input_ids,attention_mask.

As we can see, it gives us an error, it tells us that the model does not return the value of the loss, which is key to be able to train, let's see why.

First, let's see what an example of the dataset looks like.

	
		from transformers import TrainingArguments
metric_name = "accuracy"
model_name = "GPT2-small-finetuned-Maximofn-short-jokes-dataset-casualLM"
output_dir = f"./training_results"
LR = 2e-5
BS_TRAIN = 28
BS_EVAL = 32
EPOCHS = 3
WEIGHT_DECAY = 0.01
WARMUP_STEPS = 100
training_args = TrainingArguments(
    model_name,
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=LR,
    per_device_train_batch_size=BS_TRAIN,
    per_device_eval_batch_size=BS_EVAL,
    warmup_steps=WARMUP_STEPS,
    num_train_epochs=EPOCHS,
    weight_decay=WEIGHT_DECAY,
    lr_scheduler_type="cosine",
    warmup_ratio = 0.1,
    fp16=True,
    load_best_model_at_end=True,
    # metric_for_best_model=metric_name,
    push_to_hub=True,
)
from transformers import Trainer
trainer = Trainer(
    model,
    training_args,
    train_dataset=train_dataset,
    eval_dataset=validation_dataset,
    tokenizer=tokenizer,
    # compute_metrics=compute_metrics,
)
trainer.train()
idx = randint(0, len(train_dataset) - 1)
sample = train_dataset[idx]
sample

	
		  0%|          | 0/625473 [00:00<?, ?it/s]
{'input_ids': [50257,
  4162,
  750,
  262,
  18757,
  6451,
  2245,
  2491,
  30,
  4362,
  340,
  373,
  734,
  10032,
  13,
  220,
  50258,
  50256,
  50256,
  ...,
  50256,
  50256,
  50256],
 'attention_mask': [1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  0,
  0,
  0,
  ...,
  0,
  0,
  0]}

As we can see, we have a dictionary with the input_ids and the attention_mask, if we pass it to the model we obtain this

	
		import torch
output = model(
    input_ids=torch.Tensor(sample["input_ids"]).long().unsqueeze(0).to(model.device),
    attention_mask=torch.Tensor(sample["attention_mask"]).long().unsqueeze(0).to(model.device),
)
print(output.loss)

	
		None

As we can see it does not return the value of the loss because it is waiting for a value for labels, which we have not passed it. In the previous example, in which we did fine tuning for text classification, we said that the labels had to be passed to a field of the dataset called labels, but in this case we do not have that field in the dataset.

If we now assign the lables to the input_ids and look again at the input_ids loss

	
		import torch
output = model(
    input_ids=torch.Tensor(sample["input_ids"]).long().unsqueeze(0).to(model.device),
    attention_mask=torch.Tensor(sample["attention_mask"]).long().unsqueeze(0).to(model.device),
    labels=torch.Tensor(sample["input_ids"]).long().unsqueeze(0).to(model.device)
)
print(output.loss)

	
		tensor(102.1873, device='cuda:0', grad_fn=<NllLossBackward0>)

Now we get a loss.

So we have two options, add a labels field to the dataset, with the values of input_ids or use a function of the transformers library called data_collator, in this case we will use DataCollatorForLanguageModeling. Let's take a look at it

	
		from transformers import DataCollatorForLanguageModeling
      
      my_data_collator=DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

We pass the sample sample through this data_collator.

	
		from transformers import DataCollatorForLanguageModeling
      
      my_data_collator=DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
collated_sample = my_data_collator([sample]).to(model.device)

We see how the output is

	
		from transformers import DataCollatorForLanguageModeling
my_data_collator=DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
collated_sample = my_data_collator([sample]).to(model.device)
for key, value in collated_sample.items():
    print(f"{key} ({value.shape}): {value}")

	
		input_ids (torch.Size([1, 768])): tensor([[50257,  4162,   750,   262, 18757,  6451,  2245,  2491,    30,  4362,
           340,   373,   734, 10032,    13,   220, 50258, 50256, ..., 50256, 50256]],
       device='cuda:0')
attention_mask (torch.Size([1, 768])): tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, ..., 0, 0]],
       device='cuda:0')
labels (torch.Size([1, 768])): tensor([[50257,  4162,   750,   262, 18757,  6451,  2245,  2491,    30,  4362,
           340,   373,   734, 10032,    13,   220, 50258,  -100,  ...,  -100,  -100]],
       device='cuda:0')

As you can see, the data_collator has created a labels field and assigned it the values of input_ids. The tokens that are masked have been assigned the value -100. This is because when we define the data_collator we pass the parameter mlm=False, which means that we are not doing Masked Language Modeling, but Language Modeling, that is why it does not mask any original token.

Let's see if now we get a loss with this data_collator.

	
		output = model(**collated_sample)
output.loss

	
		tensor(102.7181, device='cuda:0', grad_fn=<NllLossBackward0>)

So we redefine the trainer with the data_collator and we train again

	
		from transformers import DataCollatorForLanguageModeling
      
      trainer = Trainer(
          model,
          training_args,
          train_dataset=train_dataset,
          eval_dataset=validation_dataset,
          tokenizer=tokenizer,
          data_collator=DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False),
      )

trainer.train()

[22341/22341 2:33:28, Epoch 3/3]

Epoch	Training Loss	Validation Loss
1	3.386600	3.258979
2	3.259900	3.199673
3	3.212600	3.192009

There were missing keys in the checkpoint model loaded: ['lm_head.weight'].

TrainOutput(global_step=22341, training_loss=3.505178199598342, metrics={'train_runtime': 9209.5353, 'train_samples_per_second': 67.916, 'train_steps_per_second': 2.426, 'total_flos': 2.45146666696704e+17, 'train_loss': 3.505178199598342, 'epoch': 3.0})

Evaluation

Once the model is trained we evaluate the model on the test dataset

trainer.evaluate(eval_dataset=test_dataset)

[362/362 01:04]

{'eval_loss': 3.201305866241455,
       'eval_runtime': 65.0033,
       'eval_samples_per_second': 178.191,
       'eval_steps_per_second': 5.569,
       'epoch': 3.0}

Publish the model

We create the model card

	
		from transformers import DataCollatorForLanguageModeling
      
      trainer = Trainer(
          model,
          training_args,
          train_dataset=train_dataset,
          eval_dataset=validation_dataset,
          tokenizer=tokenizer,
          data_collator=DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False),
      )
trainer.train()
trainer.evaluate(eval_dataset=test_dataset)
trainer.create_model_card()

We publish it

trainer.push_to_hub()

events.out.tfevents.1720875425.8de3af1b431d.6946.1:   0%|          | 0.00/364 [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/Maximofn/GPT2-small-finetuned-Maximofn-short-jokes-dataset-casualLM/commit/d107b3bb0e02076483238f9975697761015ec390', commit_message='End of training', commit_description='', oid='d107b3bb0e02076483238f9975697761015ec390', pr_url=None, pr_revision=None, pr_num=None)

Model use

We clean as much as possible

	
		from transformers import DataCollatorForLanguageModeling
      
      trainer = Trainer(
          model,
          training_args,
          train_dataset=train_dataset,
          eval_dataset=validation_dataset,
          tokenizer=tokenizer,
          data_collator=DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False),
      )
trainer.train()
trainer.evaluate(eval_dataset=test_dataset)
trainer.create_model_card()
trainer.push_to_hub()
import torch
      import gc
      
      
      def clear_hardwares():
          torch.clear_autocast_cache()
          torch.cuda.ipc_collect()
          torch.cuda.empty_cache()
          gc.collect()
      
      
      clear_hardwares()
      clear_hardwares()

Download the model and tokenizer

	
		from transformers import DataCollatorForLanguageModeling
trainer = Trainer(
    model,
    training_args,
    train_dataset=train_dataset,
    eval_dataset=validation_dataset,
    tokenizer=tokenizer,
    data_collator=DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False),
)
trainer.train()
trainer.evaluate(eval_dataset=test_dataset)
trainer.create_model_card()
trainer.push_to_hub()
import torch
import gc
def clear_hardwares():
    torch.clear_autocast_cache()
    torch.cuda.ipc_collect()
    torch.cuda.empty_cache()
    gc.collect()
clear_hardwares()
clear_hardwares()
from transformers import AutoTokenizer, AutoModelForCausalLM
user = "maximofn"
checkpoints = f"{user}/{model_name}"
tokenizer = AutoTokenizer.from_pretrained(checkpoints)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
model = AutoModelForCausalLM.from_pretrained(checkpoints)
model.config.pad_token_id = model.config.eos_token_id

	
		events.out.tfevents.1720875425.8de3af1b431d.6946.1:   0%|          | 0.00/364 [00:00<?, ?B/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

We check that the tokenizer and the model have the 2 extra tokens we have added.

	
		tokenizer_vocab = tokenizer.get_vocab()
model_vocab = model.config.vocab_size
print(f"tokenizer_vocab: {len(tokenizer_vocab)}. model_vocab: {model_vocab}")

	
		tokenizer_vocab: 50259. model_vocab: 50259

We see that they have 50259 tokens, i.e., the 50257 tokens of GPT2 plus the 2 we added

We create a function to generate jokes

	
		def generate_joke(prompt_text):
          text = f"<SJ> {prompt_text}"
          tokens = tokenizer(text, return_tensors="pt").to(model.device)
          with torch.no_grad():
              output = model.generate(**tokens, max_new_tokens=256, eos_token_id=tokenizer.encode("<EJ>")[-1])
          return tokenizer.decode(output[0], skip_special_tokens=False)

We generate a joke

generate_joke("Why didn't the frog cross the road?")

Setting `pad_token_id` to `eos_token_id`:50258 for open-end generation.

Out[15]:

"<SJ> Why didn't the frog cross the road? Because he was frog-in-the-face. <EJ>"

If you want to test the model further you can see it in Maximofn/GPT2-small-finetuned-Maximofn-short-jokes-dataset-casualLM

Fine tuning for text classification with Pytorch

We repeat the Pytorch training

Restart the notebook to make sure

Dataset

We downloaded the same dataset as when we did the training with the Hugging Face libraries.

	
		def generate_joke(prompt_text):
          text = f"<SJ> {prompt_text}"
          tokens = tokenizer(text, return_tensors="pt").to(model.device)
          with torch.no_grad():
              output = model.generate(**tokens, max_new_tokens=256, eos_token_id=tokenizer.encode("<EJ>")[-1])
          return tokenizer.decode(output[0], skip_special_tokens=False)
generate_joke("Why didn't the frog cross the road?")
from datasets import load_dataset
      
      dataset = load_dataset("mteb/amazon_reviews_multi", "en")

We create a variable with the number of classes

	
		def generate_joke(prompt_text):
    text = f"<SJ> {prompt_text}"
    tokens = tokenizer(text, return_tensors="pt").to(model.device)
    with torch.no_grad():
        output = model.generate(**tokens, max_new_tokens=256, eos_token_id=tokenizer.encode("<EJ>")[-1])
    return tokenizer.decode(output[0], skip_special_tokens=False)
generate_joke("Why didn't the frog cross the road?")
from datasets import load_dataset
dataset = load_dataset("mteb/amazon_reviews_multi", "en")
num_classes = len(dataset['train'].unique('label'))
num_classes

	
		Setting `pad_token_id` to `eos_token_id`:50258 for open-end generation.
5

Before we processed the whole dataset to create a field called labels, but now it is not necessary because as we are going to program everything ourselves, we adapt to what the dataset looks like.

Tokenizer

We create the tokenizer. We assign the padding token so that it does not give us an error as before.

	
		from transformers import AutoTokenizer
      
      checkpoint = "openai-community/gpt2"
      tokenizer = AutoTokenizer.from_pretrained(checkpoint)
      tokenizer.pad_token = tokenizer.eos_token

We create a function for tokenizing the dataset

	
		from transformers import AutoTokenizer
      
      checkpoint = "openai-community/gpt2"
      tokenizer = AutoTokenizer.from_pretrained(checkpoint)
      tokenizer.pad_token = tokenizer.eos_token
def tokenize_function(examples):
          return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=768, return_tensors="pt")

We tokenize it. We eliminate columns that we don't need, but now we leave the text column

	
		from transformers import AutoTokenizer
      
      checkpoint = "openai-community/gpt2"
      tokenizer = AutoTokenizer.from_pretrained(checkpoint)
      tokenizer.pad_token = tokenizer.eos_token
def tokenize_function(examples):
          return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=768, return_tensors="pt")
dataset = dataset.map(tokenize_function, batched=True, remove_columns=['id', 'label_text'])

	
		from transformers import AutoTokenizer
checkpoint = "openai-community/gpt2"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
tokenizer.pad_token = tokenizer.eos_token
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=768, return_tensors="pt")
dataset = dataset.map(tokenize_function, batched=True, remove_columns=['id', 'label_text'])
dataset

	
		DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 200000
    })
    validation: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 5000
    })
    test: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 5000
    })
})

	
		percentage = 1
subset_train = dataset['train'].select(range(int(len(dataset['train']) * percentage)))
percentage = 1
subset_validation = dataset['validation'].select(range(int(len(dataset['validation']) * percentage)))
subset_test = dataset['test'].select(range(int(len(dataset['test']) * percentage)))
print(f"len subset_train: {len(subset_train)}, len subset_validation: {len(subset_validation)}, len subset_test: {len(subset_test)}")

	
		len subset_train: 200000, len subset_validation: 5000, len subset_test: 5000

Model

We import the weights and assign the padding token

	
		from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=num_classes)
model.config.pad_token_id = model.config.eos_token_id

	
		Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at openai-community/gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Device

Create the device where everything is going to be executed

	
		import torch
      
      device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

We pass the model to the device and pass it to FP16 so that it occupies less memory.

	
		import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.half().to(device)
print()

Pytorch Dataset

Create a pytorch dataset

	
		from torch.utils.data import Dataset
      
      class ReviewsDataset(Dataset):
          def __init__(self, huggingface_dataset):
              self.dataset = huggingface_dataset
      
          def __getitem__(self, idx):
              label = self.dataset[idx]['label']
              input_ids = torch.tensor(self.dataset[idx]['input_ids'])
              attention_mask = torch.tensor(self.dataset[idx]['attention_mask'])
              return input_ids, attention_mask, label
      
          def __len__(self):
              return len(self.dataset)

Instantiate the datasets

	
		from torch.utils.data import Dataset
      
      class ReviewsDataset(Dataset):
          def __init__(self, huggingface_dataset):
              self.dataset = huggingface_dataset
      
          def __getitem__(self, idx):
              label = self.dataset[idx]['label']
              input_ids = torch.tensor(self.dataset[idx]['input_ids'])
              attention_mask = torch.tensor(self.dataset[idx]['attention_mask'])
              return input_ids, attention_mask, label
      
          def __len__(self):
              return len(self.dataset)
train_dataset = ReviewsDataset(subset_train)
      validatation_dataset = ReviewsDataset(subset_validation)
      test_dataset = ReviewsDataset(subset_test)

Let's see a sample

	
		from torch.utils.data import Dataset
class ReviewsDataset(Dataset):
    def __init__(self, huggingface_dataset):
        self.dataset = huggingface_dataset
    def __getitem__(self, idx):
        label = self.dataset[idx]['label']
        input_ids = torch.tensor(self.dataset[idx]['input_ids'])
        attention_mask = torch.tensor(self.dataset[idx]['attention_mask'])
        return input_ids, attention_mask, label
    def __len__(self):
        return len(self.dataset)
train_dataset = ReviewsDataset(subset_train)
validatation_dataset = ReviewsDataset(subset_validation)
test_dataset = ReviewsDataset(subset_test)
input_ids, at_mask, label = train_dataset[0]
input_ids.shape, at_mask.shape, label

	
		(torch.Size([768]), torch.Size([768]), 0)

Pytorch Dataloader

We now create a pytorch dataloader

	
		from torch.utils.data import DataLoader
      
      BS = 12
      
      train_loader = DataLoader(train_dataset, batch_size=BS, shuffle=True)
      validation_loader = DataLoader(validatation_dataset, batch_size=BS)
      test_loader = DataLoader(test_dataset, batch_size=BS)

Let's see a sample

	
		from torch.utils.data import DataLoader
BS = 12
train_loader = DataLoader(train_dataset, batch_size=BS, shuffle=True)
validation_loader = DataLoader(validatation_dataset, batch_size=BS)
test_loader = DataLoader(test_dataset, batch_size=BS)
input_ids, at_mask, labels = next(iter(train_loader))
input_ids.shape, at_mask.shape, labels

	
		(torch.Size([12, 768]),
 torch.Size([12, 768]),
 tensor([2, 1, 2, 0, 3, 3, 0, 4, 3, 3, 4, 2]))

To check that everything is OK we pass the sample to the model to see if everything is OK. First we pass the tokens to the device

	
		input_ids = input_ids.to(device)
      at_mask = at_mask.to(device)
      labels = labels.to(device)

Now we pass them to the model

	
		input_ids = input_ids.to(device)
at_mask = at_mask.to(device)
labels = labels.to(device)
output = model(input_ids=input_ids, attention_mask=at_mask, labels=labels)
output.keys()

	
		odict_keys(['loss', 'logits', 'past_key_values'])

As we can see it gives us the loss and the logits

	
		output['loss']

	
		tensor(5.9414, device='cuda:0', dtype=torch.float16,
       grad_fn=<NllLossBackward0>)

	
		output['logits']

	
		tensor([[ 6.1953e+00, -1.2275e+00, -2.4824e+00,  5.8867e+00, -1.4734e+01],
        [ 5.4062e+00, -8.4570e-01, -2.3203e+00,  5.1055e+00, -1.1555e+01],
        [ 6.1641e+00, -9.3066e-01, -2.5664e+00,  6.0039e+00, -1.4570e+01],
        [ 5.2266e+00, -4.2358e-01, -2.0801e+00,  4.7461e+00, -1.1570e+01],
        [ 3.8184e+00, -2.3460e-03, -1.7666e+00,  3.4160e+00, -7.7969e+00],
        [ 4.1641e+00, -4.8169e-01, -1.6914e+00,  3.9941e+00, -8.7734e+00],
        [ 4.6758e+00, -3.0298e-01, -2.1641e+00,  4.1055e+00, -9.3359e+00],
        [ 4.1953e+00, -3.2471e-01, -2.1875e+00,  3.9375e+00, -8.3438e+00],
        [-1.1650e+00,  1.3564e+00, -6.2158e-01, -6.8115e-01,  4.8672e+00],
        [ 4.4961e+00, -8.7891e-02, -2.2793e+00,  4.2812e+00, -9.3359e+00],
        [ 4.9336e+00, -2.6627e-03, -2.1543e+00,  4.3711e+00, -1.0742e+01],
        [ 5.9727e+00, -4.3152e-02, -1.4551e+00,  4.3438e+00, -1.2117e+01]],
       device='cuda:0', dtype=torch.float16, grad_fn=<IndexBackward0>)

Metric

We are going to create a function to obtain the metric, which in this case is going to be the accuracy

	
		def predicted_labels(logits):
          percent = torch.softmax(logits, dim=1)
          predictions = torch.argmax(percent, dim=1)
          return predictions

	
		def predicted_labels(logits):
          percent = torch.softmax(logits, dim=1)
          predictions = torch.argmax(percent, dim=1)
          return predictions
def compute_accuracy(logits, labels):
          predictions = predicted_labels(logits)
          correct = (predictions == labels).float()
          return correct.mean()

Let's see how well it calculates

	
		def predicted_labels(logits):
    percent = torch.softmax(logits, dim=1)
    predictions = torch.argmax(percent, dim=1)
    return predictions
def compute_accuracy(logits, labels):
    predictions = predicted_labels(logits)
    correct = (predictions == labels).float()
    return correct.mean()
compute_accuracy(output['logits'], labels).item()

	
		0.1666666716337204

Optimizer

As we are going to need an optimizer, we create one

	
		from transformers import AdamW
LR = 2e-5
optimizer = AdamW(model.parameters(), lr=LR)

	
		/usr/local/lib/python3.10/dist-packages/transformers/optimization.py:588: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(

Training

We create the training loop

	
		from tqdm import tqdm
EPOCHS = 3
accuracy = 0
for epoch in range(EPOCHS):
    model.train()
    train_loss = 0
    progresbar = tqdm(train_loader, total=len(train_loader), desc=f'Epoch {epoch + 1}')
    for input_ids, at_mask, labels in progresbar:
        input_ids = input_ids.to(device)
        at_mask = at_mask.to(device)
        label = labels.to(device)
        output = model(input_ids=input_ids, attention_mask=at_mask, labels=label)
        loss = output['loss']
        train_loss += loss.item()
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        progresbar.set_postfix({'train_loss': loss.item()})
    train_loss /= len(train_loader)
    progresbar.set_postfix({'train_loss': train_loss})
    model.eval()
    valid_loss = 0
    progresbar = tqdm(validation_loader, total=len(validation_loader), desc=f'Epoch {epoch + 1}')
    for input_ids, at_mask, labels in progresbar:
        input_ids = input_ids.to(device)
        at_mask = at_mask.to(device)
        labels = labels.to(device)
        output = model(input_ids=input_ids, attention_mask=at_mask, labels=labels)
        loss = output['loss']
        valid_loss += loss.item()
        step_accuracy = compute_accuracy(output['logits'], labels)
        accuracy += step_accuracy
        progresbar.set_postfix({'valid_loss': loss.item(), 'accuracy': step_accuracy.item()})
    valid_loss /= len(validation_loader)
    accuracy /= len(validation_loader)
    progresbar.set_postfix({'valid_loss': valid_loss, 'accuracy': accuracy})

	
		Epoch 1: 100%|██████████| 16667/16667 [44:13<00:00,  6.28it/s, train_loss=nan]
Epoch 1: 100%|██████████| 417/417 [00:32<00:00, 12.72it/s, valid_loss=nan, accuracy=0]
Epoch 2: 100%|██████████| 16667/16667 [44:06<00:00,  6.30it/s, train_loss=nan]
Epoch 2: 100%|██████████| 417/417 [00:32<00:00, 12.77it/s, valid_loss=nan, accuracy=0]
Epoch 3: 100%|██████████| 16667/16667 [44:03<00:00,  6.30it/s, train_loss=nan]
Epoch 3: 100%|██████████| 417/417 [00:32<00:00, 12.86it/s, valid_loss=nan, accuracy=0]

Model use

Let's test the model we have trained

First we tokenize a text

	
		input_tokens = tokenize_function({"text": "I love this product. It is amazing."})
input_tokens['input_ids'].shape, input_tokens['attention_mask'].shape

	
		(torch.Size([1, 768]), torch.Size([1, 768]))

Now we pass it to the model

	
		output = model(input_ids=input_tokens['input_ids'].to(device), attention_mask=input_tokens['attention_mask'].to(device))
output['logits']

	
		tensor([[nan, nan, nan, nan, nan]], device='cuda:0', dtype=torch.float16,
       grad_fn=<IndexBackward0>)

We see the predictions of these logits

	
		predicted = predicted_labels(output['logits'])
predicted

	
		tensor([0], device='cuda:0')

Fine tuning for text generation with Pytorch

We repeat the Pytorch training

Restart the notebook to make sure

Dataset

We download again the dataset of jokes

	
		from datasets import load_dataset
jokes = load_dataset("Maximofn/short-jokes-dataset")
jokes

	
		DatasetDict({
    train: Dataset({
        features: ['ID', 'Joke'],
        num_rows: 231657
    })
})

Create a subset in case you are short on memory

	
		percent_of_train_dataset = 1 # If you want 50% of the dataset, set this to 0.5
subset_dataset = jokes["train"].select(range(int(len(jokes["train"]) * percent_of_train_dataset)))
subset_dataset

	
		Dataset({
    features: ['ID', 'Joke'],
    num_rows: 231657
})

We divide the dataset into subsets for training, validation and test.

	
		percent_of_train_dataset = 0.90
split_dataset = subset_dataset.train_test_split(train_size=int(subset_dataset.num_rows * percent_of_train_dataset), seed=19, shuffle=False)
train_dataset = split_dataset["train"]
validation_test_dataset = split_dataset["test"]
split_dataset = validation_test_dataset.train_test_split(train_size=int(validation_test_dataset.num_rows * 0.5), seed=19, shuffle=False)
validation_dataset = split_dataset["train"]
test_dataset = split_dataset["test"]
print(f"Size of the train set: {len(train_dataset)}. Size of the validation set: {len(validation_dataset)}. Size of the test set: {len(test_dataset)}")

	
		Size of the train set: 208491. Size of the validation set: 11583. Size of the test set: 11583

Tokenizer

We start the tokenizer and assign the end of string token to the padding token

	
		from transformers import AutoTokenizer
      
      checkpoints = "openai-community/gpt2"
      
      tokenizer = AutoTokenizer.from_pretrained(checkpoints)
      tokenizer.pad_token = tokenizer.eos_token
      tokenizer.padding_side = "right"

We add the special tokens for the beginning and end of the joke.

	
		from transformers import AutoTokenizer
checkpoints = "openai-community/gpt2"
tokenizer = AutoTokenizer.from_pretrained(checkpoints)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
new_tokens = ['<SJ>', '<EJ>'] # Start and end of joke tokens
num_added_tokens = tokenizer.add_tokens(new_tokens)
print(f"Added {num_added_tokens} tokens")

	
		Added 2 tokens

We add them to the dataset

	
		joke_column = "Joke"
def format_joke(example):
    example[joke_column] = '<SJ> ' + example['Joke'] + ' <EJ>'
    return example
remove_columns = [column for column in train_dataset.column_names if column != joke_column]
train_dataset = train_dataset.map(format_joke, remove_columns=remove_columns)
validation_dataset = validation_dataset.map(format_joke, remove_columns=remove_columns)
test_dataset = test_dataset.map(format_joke, remove_columns=remove_columns)
train_dataset, validation_dataset, test_dataset

	
		(Dataset({
     features: ['Joke'],
     num_rows: 208491
 }),
 Dataset({
     features: ['Joke'],
     num_rows: 11583
 }),
 Dataset({
     features: ['Joke'],
     num_rows: 11583
 }))

We tokenize the dataset

	
		def tokenize_function(examples):
    return tokenizer(examples[joke_column], padding="max_length", truncation=True, max_length=768, return_tensors="pt")
train_dataset = train_dataset.map(tokenize_function, batched=True, remove_columns=[joke_column])
validation_dataset = validation_dataset.map(tokenize_function, batched=True, remove_columns=[joke_column])
test_dataset = test_dataset.map(tokenize_function, batched=True, remove_columns=[joke_column])
train_dataset, validation_dataset, test_dataset

	
		(Dataset({
     features: ['input_ids', 'attention_mask'],
     num_rows: 208491
 }),
 Dataset({
     features: ['input_ids', 'attention_mask'],
     num_rows: 11583
 }),
 Dataset({
     features: ['input_ids', 'attention_mask'],
     num_rows: 11583
 }))

Model

Instantiate the model, assign the padding token and add the new joke start and joke end tokens.

	
		from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(checkpoints)
model.config.pad_token_id = model.config.eos_token_id
model.resize_token_embeddings(len(tokenizer))

	
		Embedding(50259, 768)

Device

We create the device and pass the model to the device

	
		import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.half().to(device)
print()

Pytorch Dataset

Create a pytorch dataset

	
		from torch.utils.data import Dataset
      
      class JokesDataset(Dataset):
          def __init__(self, huggingface_dataset):
              self.dataset = huggingface_dataset
      
          def __getitem__(self, idx):
              input_ids = torch.tensor(self.dataset[idx]['input_ids'])
              attention_mask = torch.tensor(self.dataset[idx]['attention_mask'])
              return input_ids, attention_mask
      
          def __len__(self):
              return len(self.dataset)

Instantiate training, validation and test datasets

	
		from torch.utils.data import Dataset
      
      class JokesDataset(Dataset):
          def __init__(self, huggingface_dataset):
              self.dataset = huggingface_dataset
      
          def __getitem__(self, idx):
              input_ids = torch.tensor(self.dataset[idx]['input_ids'])
              attention_mask = torch.tensor(self.dataset[idx]['attention_mask'])
              return input_ids, attention_mask
      
          def __len__(self):
              return len(self.dataset)
train_pytorch_dataset = JokesDataset(train_dataset)
      validation_pytorch_dataset = JokesDataset(validation_dataset)
      test_pytorch_dataset = JokesDataset(test_dataset)

Let's see a sample

	
		from torch.utils.data import Dataset
class JokesDataset(Dataset):
    def __init__(self, huggingface_dataset):
        self.dataset = huggingface_dataset
    def __getitem__(self, idx):
        input_ids = torch.tensor(self.dataset[idx]['input_ids'])
        attention_mask = torch.tensor(self.dataset[idx]['attention_mask'])
        return input_ids, attention_mask
    def __len__(self):
        return len(self.dataset)
train_pytorch_dataset = JokesDataset(train_dataset)
validation_pytorch_dataset = JokesDataset(validation_dataset)
test_pytorch_dataset = JokesDataset(test_dataset)
input_ids, attention_mask = train_pytorch_dataset[0]
input_ids.shape, attention_mask.shape

	
		(torch.Size([768]), torch.Size([768]))

Pytorch Dataloader

We create the dataloaders

	
		from torch.utils.data import DataLoader
      
      BS = 28
      
      train_loader = DataLoader(train_pytorch_dataset, batch_size=BS, shuffle=True)
      validation_loader = DataLoader(validation_pytorch_dataset, batch_size=BS)
      test_loader = DataLoader(test_pytorch_dataset, batch_size=BS)

We see a sample

	
		from torch.utils.data import DataLoader
BS = 28
train_loader = DataLoader(train_pytorch_dataset, batch_size=BS, shuffle=True)
validation_loader = DataLoader(validation_pytorch_dataset, batch_size=BS)
test_loader = DataLoader(test_pytorch_dataset, batch_size=BS)
input_ids, attention_mask = next(iter(train_loader))
input_ids.shape, attention_mask.shape

	
		(torch.Size([28, 768]), torch.Size([28, 768]))

We pass it to the model

	
		output = model(input_ids.to(device), attention_mask=attention_mask.to(device))
output.keys()

	
		odict_keys(['logits', 'past_key_values'])

As we can see we have no loss value, as we have seen we have to pass the input_ids and the labels.

	
		output = model(input_ids.to(device), attention_mask=attention_mask.to(device), labels=input_ids.to(device))
output.keys()

	
		odict_keys(['loss', 'logits', 'past_key_values'])

Now we have loss.

	
		output['loss'].item()

	
		80.5625

Optimizer

We create an optimizer

	
		from transformers import AdamW
LR = 2e-5
optimizer = AdamW(model.parameters(), lr=5e-5)

	
		/usr/local/lib/python3.10/dist-packages/transformers/optimization.py:588: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(

Training

We create the training loop

	
		from tqdm import tqdm
EPOCHS = 3
for epoch in range(EPOCHS):
    model.train()
    train_loss = 0
    progresbar = tqdm(train_loader, total=len(train_loader), desc=f'Epoch {epoch + 1}')
    for input_ids, at_mask in progresbar:
        input_ids = input_ids.to(device)
        at_mask = at_mask.to(device)
        output = model(input_ids=input_ids, attention_mask=at_mask, labels=input_ids)
        loss = output['loss']
        train_loss += loss.item()
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        progresbar.set_postfix({'train_loss': loss.item()})
    train_loss /= len(train_loader)
    progresbar.set_postfix({'train_loss': train_loss})

	
		Epoch 1: 100%|██████████| 7447/7447 [51:07<00:00,  2.43it/s, train_loss=nan]
Epoch 2: 100%|██████████| 7447/7447 [51:06<00:00,  2.43it/s, train_loss=nan]
Epoch 3: 100%|██████████| 7447/7447 [51:07<00:00,  2.43it/s, train_loss=nan]

Model use

We tested the model

	
		def generate_text(decoded_joke, max_new_tokens=100, stop_token='<EJ>', top_k=0, temperature=1.0):
          input_tokens = tokenize_function({'Joke': decoded_joke})
          output = model(input_tokens['input_ids'].to(device), attention_mask=input_tokens['attention_mask'].to(device))
          nex_token = torch.argmax(output['logits'][:, -1, :], dim=-1).item()
          nex_token_decoded = tokenizer.decode(nex_token)
          decoded_joke = decoded_joke + nex_token_decoded
          for _ in range(max_new_tokens):
              nex_token = torch.argmax(output['logits'][:, -1, :], dim=-1).item()
              nex_token_decoded = tokenizer.decode(nex_token)
              if nex_token_decoded == stop_token:
                  break
              decoded_joke = decoded_joke + nex_token_decoded
              input_tokens = tokenize_function({'Joke': decoded_joke})
              output = model(input_tokens['input_ids'].to(device), attention_mask=input_tokens['attention_mask'].to(device))
          return decoded_joke

	
		def generate_text(decoded_joke, max_new_tokens=100, stop_token='<EJ>', top_k=0, temperature=1.0):
    input_tokens = tokenize_function({'Joke': decoded_joke})
    output = model(input_tokens['input_ids'].to(device), attention_mask=input_tokens['attention_mask'].to(device))
    nex_token = torch.argmax(output['logits'][:, -1, :], dim=-1).item()
    nex_token_decoded = tokenizer.decode(nex_token)
    decoded_joke = decoded_joke + nex_token_decoded
    for _ in range(max_new_tokens):
        nex_token = torch.argmax(output['logits'][:, -1, :], dim=-1).item()
        nex_token_decoded = tokenizer.decode(nex_token)
        if nex_token_decoded == stop_token:
            break
        decoded_joke = decoded_joke + nex_token_decoded
        input_tokens = tokenize_function({'Joke': decoded_joke})
        output = model(input_tokens['input_ids'].to(device), attention_mask=input_tokens['attention_mask'].to(device))
    return decoded_joke
generated_text = generate_text("<SJ> Why didn't the frog cross the road")
generated_text

	
		"<SJ> Why didn't the frog cross the road!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!"

Continue reading

Agents patterns

Are your agents falling short? Elevate your AI projects with advanced patterns: ReAct, planning, multi-agents, and more. Practical guide with code!

LangGraph: Revolutionize your AI agents

🚀 Revolutionize your AI agents! 🧠 LangGraph is not just another library, it's the orchestration framework that gives you total control to build complex agents, with long-term memory and even human intervention! Say goodbye to basic chatbots, it's time to create true intelligence. Dive into this post and discover it!

Create virtual environments with uv

Learn how to create virtual environments with uv, a package manager and environment for Python written in Rust, which makes it very fast. If you have had problems with the waiting times using conda, or want a faster and easier alternative to venv, enter and see how to use uv.

Last posts -->

Have you seen these projects?

Horeca chatbot

Naviground

Subtify

View all projects -->

Do you want to apply AI in your project? Contact me!

Do you want to improve with these tips?

Memory profiler

See the memory usage of a script

DataLoader with pin_memory and num_workers

Increase DataLoader performance with pin_memory and num_workers

py-smi

Python library to get GPU data like `nvidia-smi`

Last tips -->

Use this locally

Hugging Face spaces allow us to run models with very simple demos, but what if the demo breaks? Or if the user deletes it? That's why I've created docker containers with some interesting spaces, to be able to use them locally, whatever happens. In fact, if you click on any project view button, it may take you to a space that doesn't work.

Flow edit

FLUX.1-RealismLora

token_hmr

View all containers -->

Do you want to apply AI in your project? Contact me!

Do you want to train your model with these datasets?

short-jokes-dataset

Dataset with jokes in English

opus100

Dataset with translations from English to Spanish

netflix_titles

Dataset with Netflix movies and series

View more datasets -->

Fine tuning SMLs with Hugging Face

Fine tuning for text classification with Hugging Face

Login

Dataset

Tokenizer

Model

Trainer

Evaluation

Publish the model

Model use

Fine tuning for text generation with Hugging Face

Login

Dataset

Tokenizer

Model

Training

Evaluation

Publish the model

Model use

Fine tuning for text classification with Pytorch

Dataset

Tokenizer

Model

Device

Pytorch Dataset

Pytorch Dataloader

Metric

Optimizer

Training

Model use

Fine tuning for text generation with Pytorch

Dataset

Tokenizer

Model

Device

Pytorch Dataset

Pytorch Dataloader

Optimizer

Training

Model use

Continue reading

Agents patterns

LangGraph: Revolutionize your AI agents

Create virtual environments with uv

Have you seen these projects?

Do you want to apply AI in your project? Contact me!

Do you want to improve with these tips?

Memory profiler

DataLoader with pin_memory and num_workers

py-smi

Use this locally

Do you want to apply AI in your project? Contact me!

Do you want to train your model with these datasets?

short-jokes-dataset

opus100

netflix_titles