GPT1 – Improving Language Understanding by Generative Pre-Training

12 of june of 2024

GPT1 - Improving Language Understanding by Generative Pre-Training

Paper

This notebook has been automatically translated to make it accessible to more people, please let me know if you see any typos.

Improving Language Understanding by Generative Pre-Training is the GPT1 paper. Before reading the post you need to put yourself in situation, before GPT language models were based on recurrent networks (RNN), which were networks that worked relatively well for specific tasks, but with which you could not reuse the pre-training to make them a fine tuning for other tasks. In addition, they did not have much memory, so if you put very long sentences in them, they did not remember the beginning of the sentence very well.

Architecture

Before we talk about the architecture of GPT1, let's remember what the architecture of the Transformers was like.

transformer architecture

GPT1 is a model based on the transformer decoders, so as we do not have an encoder, the architecture of a single decoder is as follows

decoder architecture

The attention mechanism between the encoder and decoder sentence is eliminated.

In the GPT1 paper they propose the following architecture

gpt1 architecture

Which corresponds to the decoder of a transformer as we have seen before, executed 12 times

Paper abstract

The most interesting ideas in the paper are:

The model is trained on a large corpus of unsupervised text. This is used to create a language model. A high-capacity language model is created on a large corpus of text.
Fine-tuning is then performed on supervised NLP tasks with labeled datasets. Fine-tuning is performed on a supervised target task. In addition, when the model is evaluated on the supervised task, it is not only evaluated on that task, but on how well it predicts the next token, this helps to improve the generalization of the supervised model and makes the model converge faster.
Although we have already mentioned it, the paper says that the transformer architecture is used, since up to that time RNNs were used for the language models. This resulted in an improvement in that what was learned in the first training (training on the unsupervised text corpus) is easier to transfer to supervised tasks. That is, thanks to the use of transformers, it was possible to train on a whole corpus of text and then fine-tune it in supervised tasks.
They evaluated the model in four types of language comprehension tasks:
- Natural language inference
- Answer to questions
- Semantic similarity
- Classification of texts.
The general model (the one trained on the entire unsupervised text corpus) outperforms discriminatively trained RNN models that employ task-specific designed architectures, significantly improving the state of the art in 9 of the 12 tasks studied. They also analyze the "zero-shot" behaviors of the pre-trained model in four different environments and showed that it acquires useful linguistic knowledge for subsequent tasks.
In recent years, researchers had demonstrated the benefits of using embeddings, which are trained on unlabeled corpora, to improve performance on a variety of tasks. However, these approaches primarily transfer information at the word level, whereas the use of transformers trained on large unsupervised text corpora captures higher-level, sentence-level semantics.

Text generation

Let's see how to generate text with a pre-trained GPT1

First you have to install ftfy and spacy via

pip install ftfy spacy
      ```

Once installed, you must download the spacy language model you wish to use. For example, to download the English model, you can run:

python -m spacy download en_core_web_sm
      ```

To generate text we will use the model from the GPT1 repository of Hugging Face.

We import the libraries

	
		import torch
      from transformers import OpenAIGPTTokenizer, OpenAIGPTLMHeadModel, AutoTokenizer

If you notice we have imported OpenAIGPTTokenizer and AutoTokenizer. This is because in the model card of GPT1 it says to use OpenAIGPTTokenizer, but in the transformers library post we explain that you should use AutoTokenizer to load the tokenizer. So let's try both

	
		import torch
from transformers import OpenAIGPTTokenizer, OpenAIGPTLMHeadModel, AutoTokenizer
ckeckpoints = "openai-community/openai-gpt"
tokenizer = OpenAIGPTTokenizer.from_pretrained(ckeckpoints)
auto_tokenizer = AutoTokenizer.from_pretrained(ckeckpoints)
input_tokens = tokenizer("Hello, my dog is cute and", return_tensors="pt")
input_auto_tokens = auto_tokenizer("Hello, my dog is cute and", return_tensors="pt")
print(f"input tokens: 
{input_tokens}")
print(f"input auto tokens: 
{input_auto_tokens}")

	
		input tokens: 
{'input_ids': tensor([[3570,  240,  547, 2585,  544, 4957,  488]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]])}
input auto tokens: 
{'input_ids': tensor([[3570,  240,  547, 2585,  544, 4957,  488]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]])}

As you can see with the two tokenizers you get the same tokens. So to make the code more general, so that if you change the ckeckpoints, you don't have to change the code, let's use AutoTokenizer.

We then create the device, the tokenizer and the model

	
		device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
      
      tokenizer = AutoTokenizer.from_pretrained(ckeckpoints)
      model = OpenAIGPTLMHeadModel.from_pretrained(ckeckpoints).to(device)

As we have instantiated the model let's see how many parameters it has

	
		device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = AutoTokenizer.from_pretrained(ckeckpoints)
model = OpenAIGPTLMHeadModel.from_pretrained(ckeckpoints).to(device)
params = sum(p.numel() for p in model.parameters())
print(f"Number of parameters: {round(params/1e6)}M")

	
		Number of parameters: 117M

At the time of billions of parameters, we can see that GPT1 only had 117 million parameters.

We create the input tokens for the model

	
		input_sentence = "Hello, my dog is cute and"
input_tokens = tokenizer(input_sentence, return_tensors="pt").to(device)
input_tokens

	
		{'input_ids': tensor([[3570,  240,  547, 2585,  544, 4957,  488]], device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]], device='cuda:0')}

We pass them to the model to generate the output tokens.

output_tokens = model.generate(**input_tokens)
      
      print(f"output tokens: \n{output_tokens}")

output tokens: 
      tensor([[ 3570,   240,   547,  2585,   544,  4957,   488,   249,   719,   797,
                 485,   921,   575,   562,   246,  1671,   239,   244, 40477,   244]],
             device='cuda:0')

/home/wallabot/miniconda3/envs/nlp/lib/python3.11/site-packages/transformers/generation/utils.py:1178: UserWarning: Using the model-agnostic default `max_length` (=20) to control the generation length. We recommend setting `max_new_tokens` to control the maximum length of the generation.
        warnings.warn(

We decode the tokens to obtain the output statement

	
		output_tokens = model.generate(**input_tokens)
print(f"output tokens: 
{output_tokens}")
decoded_output = tokenizer.decode(output_tokens[0], skip_special_tokens=True)
print(f"decoded output: 
{decoded_output}")

	
		output tokens: 
tensor([[ 3570,   240,   547,  2585,   544,  4957,   488,   249,   719,   797,
           485,   921,   575,   562,   246,  1671,   239,   244, 40477,   244]],
       device='cuda:0')
decoded output: 
hello, my dog is cute and i'm going to take him for a walk. " 
 "

We have already succeeded in generating text with GPT1

Generate text token to token

Greedy search

We have used model.generate to generate the output tokens all at once, but let's see how to generate them one by one. To do this, instead of using model.generate we are going to use model, which actually calls the model.forward method.

	
		outputs = model(**input_tokens)
outputs

	
		CausalLMOutput(loss=None, logits=tensor([[[ -5.9486,  -5.8697, -18.4258,  ...,  -9.7371, -10.4495,   0.8814],
         [ -6.1212,  -4.8031, -14.3970,  ...,  -6.5411,  -9.5051,  -1.2015],
         [ -7.4231,  -6.3615, -14.7297,  ..., -10.4575,  -8.4600,  -1.5183],
         ...,
         [ -5.4751,  -5.8803, -13.7767,  ..., -10.5048, -12.4167,  -6.1584],
         [ -7.2052,  -6.0198, -21.5040,  ..., -16.2941, -14.0494,  -1.2416],
         [ -7.7240,  -7.3631, -17.3174,  ..., -12.1546, -12.3327,  -1.7169]]],
       device='cuda:0', grad_fn=<UnsafeViewBackward0>), hidden_states=None, attentions=None)

We see that it pulls a lot of data, first let's look at the output keys

	
		outputs.keys()

	
		odict_keys(['logits'])

In this case we only have the logits of the model, let's see their size

	
		logits = outputs.logits
logits.shape

	
		torch.Size([1, 7, 40478])

Let's see how many tokens we had at the entrance.

	
		input_tokens.input_ids.shape

	
		torch.Size([1, 7])

Wow, at the output we have the same number of logits as at the input. This is normal

We obtain the logits of the last position of the exit

	
		nex_token_logits = logits[0,-1]
nex_token_logits.shape

	
		torch.Size([40478])

There are a total of 40478 logits, i.e. there is a vocabulary of 40478 tokens and we have to see which token has the highest probability, to do this we first calculate the softmax

	
		softmax_logits = torch.softmax(nex_token_logits, dim=0)
softmax_logits.shape

	
		torch.Size([40478])

	
		next_token_prob, next_token_id = torch.max(softmax_logits, dim=0)
next_token_prob, next_token_id

	
		(tensor(0.1898, device='cuda:0', grad_fn=<MaxBackward0>),
 tensor(249, device='cuda:0'))

We have obtained the following token, now we decode it

	
		tokenizer.decode(next_token_id.item())

'i'

We have obtained the following token using the greedy method, i.e. the token with the highest probability. But we already saw in the transformers library post, the ways to generate texts that sampling, top-k, top-p, etc. can be done.

Let's put everything into a function and see what comes out if we generate a few tokens

	
		def generate_next_greedy_token(input_sentence, tokenizer, model, device):
          input_tokens = tokenizer(input_sentence, return_tensors="pt").to(device)
          outputs = model(**input_tokens)
          logits = outputs.logits
          nex_token_logits = logits[0,-1]
          softmax_logits = torch.softmax(nex_token_logits, dim=0)
          next_token_prob, next_token_id = torch.max(softmax_logits, dim=0)
          return next_token_prob, next_token_id

	
		def generate_next_greedy_token(input_sentence, tokenizer, model, device):
          input_tokens = tokenizer(input_sentence, return_tensors="pt").to(device)
          outputs = model(**input_tokens)
          logits = outputs.logits
          nex_token_logits = logits[0,-1]
          softmax_logits = torch.softmax(nex_token_logits, dim=0)
          next_token_prob, next_token_id = torch.max(softmax_logits, dim=0)
          return next_token_prob, next_token_id
def generate_greedy_text(input_sentence, tokenizer, model, device, max_length=20):
          generated_text = input_sentence
          for _ in range(max_length):
              next_token_prob, next_token_id = generate_next_greedy_token(generated_text, tokenizer, model, device)
              generated_text += tokenizer.decode(next_token_id.item())
          return generated_text

Now we generate text

	
		def generate_next_greedy_token(input_sentence, tokenizer, model, device):
    input_tokens = tokenizer(input_sentence, return_tensors="pt").to(device)
    outputs = model(**input_tokens)
    logits = outputs.logits
    nex_token_logits = logits[0,-1]
    softmax_logits = torch.softmax(nex_token_logits, dim=0)
    next_token_prob, next_token_id = torch.max(softmax_logits, dim=0)
    return next_token_prob, next_token_id
def generate_greedy_text(input_sentence, tokenizer, model, device, max_length=20):
    generated_text = input_sentence
    for _ in range(max_length):
        next_token_prob, next_token_id = generate_next_greedy_token(generated_text, tokenizer, model, device)
        generated_text += tokenizer.decode(next_token_id.item())
    return generated_text
generate_greedy_text("Hello, my dog is cute and", tokenizer, model, device)

	
		'Hello, my dog is cute andi."
















'

The output is quite repetitive as already seen in the ways to generate text

Fine tuning GPT

Loss calculation

Before we start doing the fine tuning of GPT1 let's see one thing. Before when we used to get the output of the model we did this

	
		outputs = model(**input_tokens)
outputs

	
		CausalLMOutput(loss=None, logits=tensor([[[ -5.9486,  -5.8697, -18.4258,  ...,  -9.7371, -10.4495,   0.8814],
         [ -6.1212,  -4.8031, -14.3970,  ...,  -6.5411,  -9.5051,  -1.2015],
         [ -7.4231,  -6.3615, -14.7297,  ..., -10.4575,  -8.4600,  -1.5183],
         ...,
         [ -5.4751,  -5.8803, -13.7767,  ..., -10.5048, -12.4167,  -6.1584],
         [ -7.2052,  -6.0198, -21.5040,  ..., -16.2941, -14.0494,  -1.2416],
         [ -7.7240,  -7.3631, -17.3174,  ..., -12.1546, -12.3327,  -1.7169]]],
       device='cuda:0', grad_fn=<UnsafeViewBackward0>), hidden_states=None, attentions=None)

You can see that we get loss=None.

	
		print(outputs.loss)

	
		None

As we are going to need the loss to do the fine tuning, let's see how to obtain it.

If we go to the documentation of the method forward of OpenAIGPTLMHeadModel, we can see that it says that at the output it returns an object of type transformers.modeling_outputs.CausalLMOutput, so if we go to the documentation of transformers.modeling_outputs.CausalLMOutput, we can see that it says that it returns loss if labels is passed to the forward method.

If we go to the source code of the forward method, we see this code block

loss = None
              if labels is not None:
                  # Shift so that tokens < n predict n
                  shift_logits = lm_logits[..., :-1, :].contiguous()
                  shift_labels = labels[..., 1:].contiguous()
                  # Flatten the tokens
                  loss_fct = CrossEntropyLoss()
                  loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))
      ```
      
      In other words, the `loss` is calculated as follows
      
       * Shift of logits and labels: The first part is to shift the logits (`lm_logits`) and labels (`labels`) so that `tokens < n` predict `n`, i.e., from a position `n` the next token is predicted from the previous ones.
       * CrossEntropyLoss: An instance of the `CrossEntropyLoss()` function is created.
       * Flatten tokens: Logits and labels are then flattened using `view(-1, shift_logits.size(-1))` and `view(-1)`, respectively. This is done so that the logits and labels have the same shape for the loss function.
       * Loss calculation: Finally, the loss is calculated using the `CrossEntropyLoss()` function with the flattened logits and flattened labels as inputs.
      
      In summary, `loss` is calculated as the cross-entropy loss between shifted and flattened logits and shifted and flattened labels.
      
      Therefore, if we pass the labels to the `forward` method, it will return the `loss`.

	
		outputs = model(**input_tokens, labels=input_tokens.input_ids)
outputs.loss

	
		tensor(4.2607, device='cuda:0', grad_fn=<NllLossBackward0>)

Dataset

For the training we are going to use a dataset of English jokes short-jokes-dataset, which is a dataset with 231 thousand English jokes.

Download the dataset

	
		from datasets import load_dataset
jokes = load_dataset("Maximofn/short-jokes-dataset")
jokes

	
		DatasetDict({
    train: Dataset({
        features: ['ID', 'Joke'],
        num_rows: 231657
    })
})

Let's take a look at it

	
		jokes["train"][0]

	
		{'ID': 1,
 'Joke': '[me narrating a documentary about narrators] "I can't hear what they're saying cuz I'm talking"'}

Pytorch training

First let's see how the pure Pytorch training would be done.

Restart the notebook to avoid problems with the GPU memory.

	
		import torch
      from transformers import OpenAIGPTLMHeadModel, AutoTokenizer
      
      device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
      
      ckeckpoints = "openai-community/openai-gpt"
      tokenizer = AutoTokenizer.from_pretrained(ckeckpoints)
      model = OpenAIGPTLMHeadModel.from_pretrained(ckeckpoints)
      
      model = model.to(device)

Pytorch dataset

We create a Pytorch dataset class

	
		import torch
      from transformers import OpenAIGPTLMHeadModel, AutoTokenizer
      
      device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
      
      ckeckpoints = "openai-community/openai-gpt"
      tokenizer = AutoTokenizer.from_pretrained(ckeckpoints)
      model = OpenAIGPTLMHeadModel.from_pretrained(ckeckpoints)
      
      model = model.to(device)
from torch.utils.data import Dataset
      
      class JokesDataset(Dataset):
          def __init__(self, dataset, tokenizer):
              self.dataset = dataset
              self.joke = "JOKE: "
              self.end_of_text_token = "<|endoftext|>"
              self.tokenizer = tokenizer
              
          def __len__(self):
              return len(self.dataset["train"])
      
          def __getitem__(self, item):
              sentence = self.joke + self.dataset["train"][item]["Joke"] + self.end_of_text_token
              tokens = self.tokenizer(sentence, return_tensors="pt")
              return sentence, tokens

We instantiate it

	
		import torch
      from transformers import OpenAIGPTLMHeadModel, AutoTokenizer
      
      device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
      
      ckeckpoints = "openai-community/openai-gpt"
      tokenizer = AutoTokenizer.from_pretrained(ckeckpoints)
      model = OpenAIGPTLMHeadModel.from_pretrained(ckeckpoints)
      
      model = model.to(device)
from torch.utils.data import Dataset
      
      class JokesDataset(Dataset):
          def __init__(self, dataset, tokenizer):
              self.dataset = dataset
              self.joke = "JOKE: "
              self.end_of_text_token = "<|endoftext|>"
              self.tokenizer = tokenizer
              
          def __len__(self):
              return len(self.dataset["train"])
      
          def __getitem__(self, item):
              sentence = self.joke + self.dataset["train"][item]["Joke"] + self.end_of_text_token
              tokens = self.tokenizer(sentence, return_tensors="pt")
              return sentence, tokens
dataset = JokesDataset(jokes, tokenizer=tokenizer)

Here is an example

sentence, tokens = dataset[5]
      print(sentence)
      tokens.input_ids.shape, tokens.attention_mask.shape

JOKE: Why can't Barbie get pregnant? Because Ken comes in a different box. Heyooooooo<|endoftext|>

Out[27]:

(torch.Size([1, 30]), torch.Size([1, 30]))

Dataloader

We now create a Pytorch dataloader

	
		import torch
      from transformers import OpenAIGPTLMHeadModel, AutoTokenizer
      
      device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
      
      ckeckpoints = "openai-community/openai-gpt"
      tokenizer = AutoTokenizer.from_pretrained(ckeckpoints)
      model = OpenAIGPTLMHeadModel.from_pretrained(ckeckpoints)
      
      model = model.to(device)
from torch.utils.data import Dataset
      
      class JokesDataset(Dataset):
          def __init__(self, dataset, tokenizer):
              self.dataset = dataset
              self.joke = "JOKE: "
              self.end_of_text_token = "<|endoftext|>"
              self.tokenizer = tokenizer
              
          def __len__(self):
              return len(self.dataset["train"])
      
          def __getitem__(self, item):
              sentence = self.joke + self.dataset["train"][item]["Joke"] + self.end_of_text_token
              tokens = self.tokenizer(sentence, return_tensors="pt")
              return sentence, tokens
dataset = JokesDataset(jokes, tokenizer=tokenizer)
sentence, tokens = dataset[5]
      print(sentence)
      tokens.input_ids.shape, tokens.attention_mask.shape
from torch.utils.data import DataLoader
      
      BS = 1
      joke_dataloader = DataLoader(dataset, batch_size=BS, shuffle=True)

We see a batch

	
		import torch
from transformers import OpenAIGPTLMHeadModel, AutoTokenizer
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
ckeckpoints = "openai-community/openai-gpt"
tokenizer = AutoTokenizer.from_pretrained(ckeckpoints)
model = OpenAIGPTLMHeadModel.from_pretrained(ckeckpoints)
model = model.to(device)
from torch.utils.data import Dataset
class JokesDataset(Dataset):
    def __init__(self, dataset, tokenizer):
        self.dataset = dataset
        self.joke = "JOKE: "
        self.end_of_text_token = "<|endoftext|>"
        self.tokenizer = tokenizer
        
    def __len__(self):
        return len(self.dataset["train"])
    def __getitem__(self, item):
        sentence = self.joke + self.dataset["train"][item]["Joke"] + self.end_of_text_token
        tokens = self.tokenizer(sentence, return_tensors="pt")
        return sentence, tokens
dataset = JokesDataset(jokes, tokenizer=tokenizer)
sentence, tokens = dataset[5]
print(sentence)
tokens.input_ids.shape, tokens.attention_mask.shape
from torch.utils.data import DataLoader
BS = 1
joke_dataloader = DataLoader(dataset, batch_size=BS, shuffle=True)
sentences, tokens = next(iter(joke_dataloader))
len(sentences), tokens.input_ids.shape, tokens.attention_mask.shape

	
		JOKE: Why can't Barbie get pregnant? Because Ken comes in a different box. Heyooooooo<|endoftext|>
(1, torch.Size([1, 1, 29]), torch.Size([1, 1, 29]))

Training

from transformers import AdamW, get_linear_schedule_with_warmup
      import tqdm
      
      BATCH_SIZE = 32
      EPOCHS = 5
      LEARNING_RATE = 3e-5
      WARMUP_STEPS = 5000
      MAX_SEQ_LEN = 500
      
      model.train()
      optimizer = AdamW(model.parameters(), lr=LEARNING_RATE)
      scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=WARMUP_STEPS, num_training_steps=-1)
      proc_seq_count = 0
      batch_count = 0
      
      tmp_jokes_tens = None
      
      for epoch in range(EPOCHS):
          
          print(f"EPOCH {epoch} started" + '=' * 30)
          progress_bar = tqdm.tqdm(joke_dataloader, desc="Training")
          
          for sample in progress_bar:
      
              sentence, tokens = sample
              
              #################### "Fit as many joke sequences into MAX_SEQ_LEN sequence as possible" logic start ####
              joke_tens = tokens.input_ids[0].to(device)
      
              # Skip sample from dataset if it is longer than MAX_SEQ_LEN
              if joke_tens.size()[1] > MAX_SEQ_LEN:
                  continue
              
              # The first joke sequence in the sequence
              if not torch.is_tensor(tmp_jokes_tens):
                  tmp_jokes_tens = joke_tens
                  continue
              else:
                  # The next joke does not fit in so we process the sequence and leave the last joke 
                  # as the start for next sequence 
                  if tmp_jokes_tens.size()[1] + joke_tens.size()[1] > MAX_SEQ_LEN:
                      work_jokes_tens = tmp_jokes_tens
                      tmp_jokes_tens = joke_tens
                  else:
                      #Add the joke to sequence, continue and try to add more
                      tmp_jokes_tens = torch.cat([tmp_jokes_tens, joke_tens[:,1:]], dim=1)
                      continue
              ################## Sequence ready, process it trough the model ##################
                  
              outputs = model(work_jokes_tens, labels=work_jokes_tens)
              loss = outputs.loss
              loss.backward()
                             
              proc_seq_count = proc_seq_count + 1
              if proc_seq_count == BATCH_SIZE:
                  proc_seq_count = 0    
                  batch_count += 1
                  optimizer.step()
                  scheduler.step() 
                  optimizer.zero_grad()
                  model.zero_grad()
      
              progress_bar.set_postfix({'loss': loss.item(), 'lr': scheduler.get_last_lr()[0]})
              if batch_count == 10:
                  batch_count = 0

/home/wallabot/miniconda3/envs/nlp/lib/python3.11/site-packages/transformers/optimization.py:429: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
        warnings.warn(

EPOCH 0 started==============================

Training: 100%|██████████| 231657/231657 [11:31<00:00, 334.88it/s, loss=2.88, lr=2.93e-6]

EPOCH 1 started==============================

Training: 100%|██████████| 231657/231657 [11:30<00:00, 335.27it/s, loss=2.49, lr=5.87e-6]

EPOCH 2 started==============================

Training: 100%|██████████| 231657/231657 [11:17<00:00, 341.75it/s, loss=2.57, lr=8.81e-6]

EPOCH 3 started==============================

Training: 100%|██████████| 231657/231657 [11:18<00:00, 341.27it/s, loss=2.41, lr=1.18e-5]

EPOCH 4 started==============================

Training: 100%|██████████| 231657/231657 [11:19<00:00, 341.04it/s, loss=2.49, lr=1.47e-5]

Inference

Let's see how well the model makes jokes.

	
		from transformers import AdamW, get_linear_schedule_with_warmup
import tqdm
BATCH_SIZE = 32
EPOCHS = 5
LEARNING_RATE = 3e-5
WARMUP_STEPS = 5000
MAX_SEQ_LEN = 500
model.train()
optimizer = AdamW(model.parameters(), lr=LEARNING_RATE)
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=WARMUP_STEPS, num_training_steps=-1)
proc_seq_count = 0
batch_count = 0
tmp_jokes_tens = None
for epoch in range(EPOCHS):
    
    print(f"EPOCH {epoch} started" + '=' * 30)
    progress_bar = tqdm.tqdm(joke_dataloader, desc="Training")
    
    for sample in progress_bar:
        sentence, tokens = sample
        
        #################### "Fit as many joke sequences into MAX_SEQ_LEN sequence as possible" logic start ####
        joke_tens = tokens.input_ids[0].to(device)
        # Skip sample from dataset if it is longer than MAX_SEQ_LEN
        if joke_tens.size()[1] > MAX_SEQ_LEN:
            continue
        
        # The first joke sequence in the sequence
        if not torch.is_tensor(tmp_jokes_tens):
            tmp_jokes_tens = joke_tens
            continue
        else:
            # The next joke does not fit in so we process the sequence and leave the last joke 
            # as the start for next sequence 
            if tmp_jokes_tens.size()[1] + joke_tens.size()[1] > MAX_SEQ_LEN:
                work_jokes_tens = tmp_jokes_tens
                tmp_jokes_tens = joke_tens
            else:
                #Add the joke to sequence, continue and try to add more
                tmp_jokes_tens = torch.cat([tmp_jokes_tens, joke_tens[:,1:]], dim=1)
                continue
        ################## Sequence ready, process it trough the model ##################
            
        outputs = model(work_jokes_tens, labels=work_jokes_tens)
        loss = outputs.loss
        loss.backward()
                       
        proc_seq_count = proc_seq_count + 1
        if proc_seq_count == BATCH_SIZE:
            proc_seq_count = 0    
            batch_count += 1
            optimizer.step()
            scheduler.step() 
            optimizer.zero_grad()
            model.zero_grad()
        progress_bar.set_postfix({'loss': loss.item(), 'lr': scheduler.get_last_lr()[0]})
        if batch_count == 10:
            batch_count = 0
sentence_joke = "JOKE:"
input_tokens_joke = tokenizer(sentence_joke, return_tensors="pt").to(device)
output_tokens_joke = model.generate(**input_tokens_joke)
decoded_output_joke = tokenizer.decode(output_tokens_joke[0], skip_special_tokens=True)
print(f"decoded joke: 
{decoded_output_joke}")

	
		/home/wallabot/miniconda3/envs/nlp/lib/python3.11/site-packages/transformers/optimization.py:429: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
decoded joke: 
joke : what do you call a group of people who are not afraid of the dark? a group

You can see that you pass it a sequence with the word joke and it returns a joke. But if you return another sequence it does not

	
		sentence_joke = "My dog is cute and"
input_tokens_joke = tokenizer(sentence_joke, return_tensors="pt").to(device)
output_tokens_joke = model.generate(**input_tokens_joke)
decoded_output_joke = tokenizer.decode(output_tokens_joke[0], skip_special_tokens=True)
print(f"decoded joke: 
{decoded_output_joke}")

	
		decoded joke: 
my dog is cute and i'm not sure if i should be offended or not. "

Continue reading

Create virtual environments with uv

Learn how to create virtual environments with uv, a package manager and environment for Python written in Rust, which makes it very fast. If you have had problems with the waiting times using conda, or want a faster and easier alternative to venv, enter and see how to use uv.

Make a real-time AI application with FastRTC

If you have problems making a real-time AI application, FastRTC can help you. In this post I explain how to use it.

Generate videos with Wan2.1-T2V-14B and Inference Providers

Do you want to have your own Sora, but also generate good videos? In this post I explain how to do it with HuggingFace Inference Providers and Replicate.

Last posts -->

Have you seen these projects?

Horeca chatbot

Naviground

Subtify

View all projects -->

Do you want to apply AI in your project? Contact me!

Do you want to improve with these tips?

Memory profiler

See the memory usage of a script

DataLoader with pin_memory and num_workers

Increase DataLoader performance with pin_memory and num_workers

py-smi

Python library to get GPU data like `nvidia-smi`

Last tips -->

Use this locally

Hugging Face spaces allow us to run models with very simple demos, but what if the demo breaks? Or if the user deletes it? That's why I've created docker containers with some interesting spaces, to be able to use them locally, whatever happens. In fact, if you click on any project view button, it may take you to a space that doesn't work.