GPT1 – Improving Language Understanding by Generative Pre-Training

GPT1 – Improving Language Understanding by Generative Pre-Training GPT1 – Improving Language Understanding by Generative Pre-Training

GPT1 - Improving Language Understanding by Generative Pre-Traininglink image 15

Paperlink image 16

This notebook has been automatically translated to make it accessible to more people, please let me know if you see any typos.

Improving Language Understanding by Generative Pre-Training is the GPT1 paper. Before reading the post you need to put yourself in situation, before GPT language models were based on recurrent networks (RNN), which were networks that worked relatively well for specific tasks, but with which you could not reuse the pre-training to make them a fine tuning for other tasks. In addition, they did not have much memory, so if you put very long sentences in them, they did not remember the beginning of the sentence very well.

Architecturelink image 17

Before we talk about the architecture of GPT1, let's remember what the architecture of the Transformers was like.

transformer architecture

GPT1 is a model based on the transformer decoders, so as we do not have an encoder, the architecture of a single decoder is as follows

decoder architecture

The attention mechanism between the encoder and decoder sentence is eliminated.

In the GPT1 paper they propose the following architecture

gpt1 architecture

Which corresponds to the decoder of a transformer as we have seen before, executed 12 times

Paper abstractlink image 18

The most interesting ideas in the paper are:

  • The model is trained on a large corpus of unsupervised text. This is used to create a language model. A high-capacity language model is created on a large corpus of text.
  • Fine-tuning is then performed on supervised NLP tasks with labeled datasets. Fine-tuning is performed on a supervised target task. In addition, when the model is evaluated on the supervised task, it is not only evaluated on that task, but on how well it predicts the next token, this helps to improve the generalization of the supervised model and makes the model converge faster.
  • Although we have already mentioned it, the paper says that the transformer architecture is used, since up to that time RNNs were used for the language models. This resulted in an improvement in that what was learned in the first training (training on the unsupervised text corpus) is easier to transfer to supervised tasks. That is, thanks to the use of transformers, it was possible to train on a whole corpus of text and then fine-tune it in supervised tasks.
  • They evaluated the model in four types of language comprehension tasks:
    • Natural language inference
    • Answer to questions
    • Semantic similarity
    • Classification of texts.
  • The general model (the one trained on the entire unsupervised text corpus) outperforms discriminatively trained RNN models that employ task-specific designed architectures, significantly improving the state of the art in 9 of the 12 tasks studied. They also analyze the "zero-shot" behaviors of the pre-trained model in four different environments and showed that it acquires useful linguistic knowledge for subsequent tasks.
  • In recent years, researchers had demonstrated the benefits of using embeddings, which are trained on unlabeled corpora, to improve performance on a variety of tasks. However, these approaches primarily transfer information at the word level, whereas the use of transformers trained on large unsupervised text corpora captures higher-level, sentence-level semantics.

Text generationlink image 19

Let's see how to generate text with a pre-trained GPT1

First you have to install ftfy and spacy via

pip install ftfy spacy
      ```
      

Once installed, you must download the spacy language model you wish to use. For example, to download the English model, you can run:

python -m spacy download en_core_web_sm
      ```
      

To generate text we will use the model from the GPT1 repository of Hugging Face.

We import the libraries

	
import torch
from transformers import OpenAIGPTTokenizer, OpenAIGPTLMHeadModel, AutoTokenizer
Copy

If you notice we have imported OpenAIGPTTokenizer and AutoTokenizer. This is because in the model card of GPT1 it says to use OpenAIGPTTokenizer, but in the transformers library post we explain that you should use AutoTokenizer to load the tokenizer. So let's try both

	
import torch
from transformers import OpenAIGPTTokenizer, OpenAIGPTLMHeadModel, AutoTokenizer
ckeckpoints = "openai-community/openai-gpt"
tokenizer = OpenAIGPTTokenizer.from_pretrained(ckeckpoints)
auto_tokenizer = AutoTokenizer.from_pretrained(ckeckpoints)
input_tokens = tokenizer("Hello, my dog is cute and", return_tensors="pt")
input_auto_tokens = auto_tokenizer("Hello, my dog is cute and", return_tensors="pt")
print(f"input tokens: {input_tokens}")
print(f"input auto tokens: {input_auto_tokens}")
Copy
	
input tokens:
{'input_ids': tensor([[3570, 240, 547, 2585, 544, 4957, 488]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]])}
input auto tokens:
{'input_ids': tensor([[3570, 240, 547, 2585, 544, 4957, 488]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]])}

As you can see with the two tokenizers you get the same tokens. So to make the code more general, so that if you change the ckeckpoints, you don't have to change the code, let's use AutoTokenizer.

We then create the device, the tokenizer and the model

	
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = AutoTokenizer.from_pretrained(ckeckpoints)
model = OpenAIGPTLMHeadModel.from_pretrained(ckeckpoints).to(device)
Copy

As we have instantiated the model let's see how many parameters it has

	
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = AutoTokenizer.from_pretrained(ckeckpoints)
model = OpenAIGPTLMHeadModel.from_pretrained(ckeckpoints).to(device)
params = sum(p.numel() for p in model.parameters())
print(f"Number of parameters: {round(params/1e6)}M")
Copy
	
Number of parameters: 117M

At the time of billions of parameters, we can see that GPT1 only had 117 million parameters.

We create the input tokens for the model

	
input_sentence = "Hello, my dog is cute and"
input_tokens = tokenizer(input_sentence, return_tensors="pt").to(device)
input_tokens
Copy
	
{'input_ids': tensor([[3570, 240, 547, 2585, 544, 4957, 488]], device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]], device='cuda:0')}

We pass them to the model to generate the output tokens.

output_tokens = model.generate(**input_tokens)
      
      print(f"output tokens: \n{output_tokens}")
      
output tokens: 
      tensor([[ 3570,   240,   547,  2585,   544,  4957,   488,   249,   719,   797,
                 485,   921,   575,   562,   246,  1671,   239,   244, 40477,   244]],
             device='cuda:0')
      
/home/wallabot/miniconda3/envs/nlp/lib/python3.11/site-packages/transformers/generation/utils.py:1178: UserWarning: Using the model-agnostic default `max_length` (=20) to control the generation length. We recommend setting `max_new_tokens` to control the maximum length of the generation.
        warnings.warn(
      

We decode the tokens to obtain the output statement

	
output_tokens = model.generate(**input_tokens)
print(f"output tokens: {output_tokens}")
decoded_output = tokenizer.decode(output_tokens[0], skip_special_tokens=True)
print(f"decoded output: {decoded_output}")
Copy
	
output tokens:
tensor([[ 3570, 240, 547, 2585, 544, 4957, 488, 249, 719, 797,
485, 921, 575, 562, 246, 1671, 239, 244, 40477, 244]],
device='cuda:0')
decoded output:
hello, my dog is cute and i'm going to take him for a walk. "
"

We have already succeeded in generating text with GPT1

Generate text token to tokenlink image 20

We have used model.generate to generate the output tokens all at once, but let's see how to generate them one by one. To do this, instead of using model.generate we are going to use model, which actually calls the model.forward method.

	
outputs = model(**input_tokens)
outputs
Copy
	
CausalLMOutput(loss=None, logits=tensor([[[ -5.9486, -5.8697, -18.4258, ..., -9.7371, -10.4495, 0.8814],
[ -6.1212, -4.8031, -14.3970, ..., -6.5411, -9.5051, -1.2015],
[ -7.4231, -6.3615, -14.7297, ..., -10.4575, -8.4600, -1.5183],
...,
[ -5.4751, -5.8803, -13.7767, ..., -10.5048, -12.4167, -6.1584],
[ -7.2052, -6.0198, -21.5040, ..., -16.2941, -14.0494, -1.2416],
[ -7.7240, -7.3631, -17.3174, ..., -12.1546, -12.3327, -1.7169]]],
device='cuda:0', grad_fn=<UnsafeViewBackward0>), hidden_states=None, attentions=None)

We see that it pulls a lot of data, first let's look at the output keys

	
outputs.keys()
Copy
	
odict_keys(['logits'])

In this case we only have the logits of the model, let's see their size

	
logits = outputs.logits
logits.shape
Copy
	
torch.Size([1, 7, 40478])

Let's see how many tokens we had at the entrance.

	
input_tokens.input_ids.shape
Copy
	
torch.Size([1, 7])

Wow, at the output we have the same number of logits as at the input. This is normal

We obtain the logits of the last position of the exit

	
nex_token_logits = logits[0,-1]
nex_token_logits.shape
Copy
	
torch.Size([40478])

There are a total of 40478 logits, i.e. there is a vocabulary of 40478 tokens and we have to see which token has the highest probability, to do this we first calculate the softmax

	
softmax_logits = torch.softmax(nex_token_logits, dim=0)
softmax_logits.shape
Copy
	
torch.Size([40478])
	
next_token_prob, next_token_id = torch.max(softmax_logits, dim=0)
next_token_prob, next_token_id
Copy
	
(tensor(0.1898, device='cuda:0', grad_fn=<MaxBackward0>),
tensor(249, device='cuda:0'))

We have obtained the following token, now we decode it

	
tokenizer.decode(next_token_id.item())
Copy
	
'i'

We have obtained the following token using the greedy method, i.e. the token with the highest probability. But we already saw in the transformers library post, the ways to generate texts that sampling, top-k, top-p, etc. can be done.

Let's put everything into a function and see what comes out if we generate a few tokens

	
def generate_next_greedy_token(input_sentence, tokenizer, model, device):
input_tokens = tokenizer(input_sentence, return_tensors="pt").to(device)
outputs = model(**input_tokens)
logits = outputs.logits
nex_token_logits = logits[0,-1]
softmax_logits = torch.softmax(nex_token_logits, dim=0)
next_token_prob, next_token_id = torch.max(softmax_logits, dim=0)
return next_token_prob, next_token_id
Copy
	
def generate_next_greedy_token(input_sentence, tokenizer, model, device):
input_tokens = tokenizer(input_sentence, return_tensors="pt").to(device)
outputs = model(**input_tokens)
logits = outputs.logits
nex_token_logits = logits[0,-1]
softmax_logits = torch.softmax(nex_token_logits, dim=0)
next_token_prob, next_token_id = torch.max(softmax_logits, dim=0)
return next_token_prob, next_token_id
def generate_greedy_text(input_sentence, tokenizer, model, device, max_length=20):
generated_text = input_sentence
for _ in range(max_length):
next_token_prob, next_token_id = generate_next_greedy_token(generated_text, tokenizer, model, device)
generated_text += tokenizer.decode(next_token_id.item())
return generated_text
Copy

Now we generate text

	
def generate_next_greedy_token(input_sentence, tokenizer, model, device):
input_tokens = tokenizer(input_sentence, return_tensors="pt").to(device)
outputs = model(**input_tokens)
logits = outputs.logits
nex_token_logits = logits[0,-1]
softmax_logits = torch.softmax(nex_token_logits, dim=0)
next_token_prob, next_token_id = torch.max(softmax_logits, dim=0)
return next_token_prob, next_token_id
def generate_greedy_text(input_sentence, tokenizer, model, device, max_length=20):
generated_text = input_sentence
for _ in range(max_length):
next_token_prob, next_token_id = generate_next_greedy_token(generated_text, tokenizer, model, device)
generated_text += tokenizer.decode(next_token_id.item())
return generated_text
generate_greedy_text("Hello, my dog is cute and", tokenizer, model, device)
Copy
	
'Hello, my dog is cute andi." '

The output is quite repetitive as already seen in the ways to generate text

Fine tuning GPTlink image 22

Loss calculationlink image 23

Before we start doing the fine tuning of GPT1 let's see one thing. Before when we used to get the output of the model we did this

	
outputs = model(**input_tokens)
outputs
Copy
	
CausalLMOutput(loss=None, logits=tensor([[[ -5.9486, -5.8697, -18.4258, ..., -9.7371, -10.4495, 0.8814],
[ -6.1212, -4.8031, -14.3970, ..., -6.5411, -9.5051, -1.2015],
[ -7.4231, -6.3615, -14.7297, ..., -10.4575, -8.4600, -1.5183],
...,
[ -5.4751, -5.8803, -13.7767, ..., -10.5048, -12.4167, -6.1584],
[ -7.2052, -6.0198, -21.5040, ..., -16.2941, -14.0494, -1.2416],
[ -7.7240, -7.3631, -17.3174, ..., -12.1546, -12.3327, -1.7169]]],
device='cuda:0', grad_fn=<UnsafeViewBackward0>), hidden_states=None, attentions=None)

You can see that we get loss=None.

	
print(outputs.loss)
Copy
	
None

As we are going to need the loss to do the fine tuning, let's see how to obtain it.

If we go to the documentation of the method forward of OpenAIGPTLMHeadModel, we can see that it says that at the output it returns an object of type transformers.modeling_outputs.CausalLMOutput, so if we go to the documentation of transformers.modeling_outputs.CausalLMOutput, we can see that it says that it returns loss if labels is passed to the forward method.

If we go to the source code of the forward method, we see this code block

loss = None
              if labels is not None:
                  # Shift so that tokens < n predict n
                  shift_logits = lm_logits[..., :-1, :].contiguous()
                  shift_labels = labels[..., 1:].contiguous()
                  # Flatten the tokens
                  loss_fct = CrossEntropyLoss()
                  loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))
      ```
      
      In other words, the `loss` is calculated as follows
      
       * Shift of logits and labels: The first part is to shift the logits (`lm_logits`) and labels (`labels`) so that `tokens < n` predict `n`, i.e., from a position `n` the next token is predicted from the previous ones.
       * CrossEntropyLoss: An instance of the `CrossEntropyLoss()` function is created.
       * Flatten tokens: Logits and labels are then flattened using `view(-1, shift_logits.size(-1))` and `view(-1)`, respectively. This is done so that the logits and labels have the same shape for the loss function.
       * Loss calculation: Finally, the loss is calculated using the `CrossEntropyLoss()` function with the flattened logits and flattened labels as inputs.
      
      In summary, `loss` is calculated as the cross-entropy loss between shifted and flattened logits and shifted and flattened labels.
      
      Therefore, if we pass the labels to the `forward` method, it will return the `loss`.
      
	
outputs = model(**input_tokens, labels=input_tokens.input_ids)
outputs.loss
Copy
	
tensor(4.2607, device='cuda:0', grad_fn=<NllLossBackward0>)

Datasetlink image 24

For the training we are going to use a dataset of English jokes short-jokes-dataset, which is a dataset with 231 thousand English jokes.

Download the dataset

	
from datasets import load_dataset
jokes = load_dataset("Maximofn/short-jokes-dataset")
jokes
Copy
	
DatasetDict({
train: Dataset({
features: ['ID', 'Joke'],
num_rows: 231657
})
})

Let's take a look at it

	
jokes["train"][0]
Copy
	
{'ID': 1,
'Joke': '[me narrating a documentary about narrators] "I can't hear what they're saying cuz I'm talking"'}

Pytorch traininglink image 25

First let's see how the pure Pytorch training would be done.

Restart the notebook to avoid problems with the GPU memory.

	
import torch
from transformers import OpenAIGPTLMHeadModel, AutoTokenizer
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
ckeckpoints = "openai-community/openai-gpt"
tokenizer = AutoTokenizer.from_pretrained(ckeckpoints)
model = OpenAIGPTLMHeadModel.from_pretrained(ckeckpoints)
model = model.to(device)
Copy

Pytorch datasetlink image 26

We create a Pytorch dataset class

	
import torch
from transformers import OpenAIGPTLMHeadModel, AutoTokenizer
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
ckeckpoints = "openai-community/openai-gpt"
tokenizer = AutoTokenizer.from_pretrained(ckeckpoints)
model = OpenAIGPTLMHeadModel.from_pretrained(ckeckpoints)
model = model.to(device)
from torch.utils.data import Dataset
class JokesDataset(Dataset):
def __init__(self, dataset, tokenizer):
self.dataset = dataset
self.joke = "JOKE: "
self.end_of_text_token = "<|endoftext|>"
self.tokenizer = tokenizer
def __len__(self):
return len(self.dataset["train"])
def __getitem__(self, item):
sentence = self.joke + self.dataset["train"][item]["Joke"] + self.end_of_text_token
tokens = self.tokenizer(sentence, return_tensors="pt")
return sentence, tokens
Copy

We instantiate it

	
import torch
from transformers import OpenAIGPTLMHeadModel, AutoTokenizer
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
ckeckpoints = "openai-community/openai-gpt"
tokenizer = AutoTokenizer.from_pretrained(ckeckpoints)
model = OpenAIGPTLMHeadModel.from_pretrained(ckeckpoints)
model = model.to(device)
from torch.utils.data import Dataset
class JokesDataset(Dataset):
def __init__(self, dataset, tokenizer):
self.dataset = dataset
self.joke = "JOKE: "
self.end_of_text_token = "<|endoftext|>"
self.tokenizer = tokenizer
def __len__(self):
return len(self.dataset["train"])
def __getitem__(self, item):
sentence = self.joke + self.dataset["train"][item]["Joke"] + self.end_of_text_token
tokens = self.tokenizer(sentence, return_tensors="pt")
return sentence, tokens
dataset = JokesDataset(jokes, tokenizer=tokenizer)
Copy

Here is an example

sentence, tokens = dataset[5]
      print(sentence)
      tokens.input_ids.shape, tokens.attention_mask.shape
      
JOKE: Why can't Barbie get pregnant? Because Ken comes in a different box. Heyooooooo<|endoftext|>
      
Out[27]:
(torch.Size([1, 30]), torch.Size([1, 30]))

Dataloaderlink image 27

We now create a Pytorch dataloader

	
import torch
from transformers import OpenAIGPTLMHeadModel, AutoTokenizer
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
ckeckpoints = "openai-community/openai-gpt"
tokenizer = AutoTokenizer.from_pretrained(ckeckpoints)
model = OpenAIGPTLMHeadModel.from_pretrained(ckeckpoints)
model = model.to(device)
from torch.utils.data import Dataset
class JokesDataset(Dataset):
def __init__(self, dataset, tokenizer):
self.dataset = dataset
self.joke = "JOKE: "
self.end_of_text_token = "<|endoftext|>"
self.tokenizer = tokenizer
def __len__(self):
return len(self.dataset["train"])
def __getitem__(self, item):
sentence = self.joke + self.dataset["train"][item]["Joke"] + self.end_of_text_token
tokens = self.tokenizer(sentence, return_tensors="pt")
return sentence, tokens
dataset = JokesDataset(jokes, tokenizer=tokenizer)
sentence, tokens = dataset[5]
print(sentence)
tokens.input_ids.shape, tokens.attention_mask.shape
from torch.utils.data import DataLoader
BS = 1
joke_dataloader = DataLoader(dataset, batch_size=BS, shuffle=True)
Copy

We see a batch

	
import torch
from transformers import OpenAIGPTLMHeadModel, AutoTokenizer
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
ckeckpoints = "openai-community/openai-gpt"
tokenizer = AutoTokenizer.from_pretrained(ckeckpoints)
model = OpenAIGPTLMHeadModel.from_pretrained(ckeckpoints)
model = model.to(device)
from torch.utils.data import Dataset
class JokesDataset(Dataset):
def __init__(self, dataset, tokenizer):
self.dataset = dataset
self.joke = "JOKE: "
self.end_of_text_token = "<|endoftext|>"
self.tokenizer = tokenizer
def __len__(self):
return len(self.dataset["train"])
def __getitem__(self, item):
sentence = self.joke + self.dataset["train"][item]["Joke"] + self.end_of_text_token
tokens = self.tokenizer(sentence, return_tensors="pt")
return sentence, tokens
dataset = JokesDataset(jokes, tokenizer=tokenizer)
sentence, tokens = dataset[5]
print(sentence)
tokens.input_ids.shape, tokens.attention_mask.shape
from torch.utils.data import DataLoader
BS = 1
joke_dataloader = DataLoader(dataset, batch_size=BS, shuffle=True)
sentences, tokens = next(iter(joke_dataloader))
len(sentences), tokens.input_ids.shape, tokens.attention_mask.shape
Copy
	
JOKE: Why can't Barbie get pregnant? Because Ken comes in a different box. Heyooooooo<|endoftext|>
(1, torch.Size([1, 1, 29]), torch.Size([1, 1, 29]))

Traininglink image 28

from transformers import AdamW, get_linear_schedule_with_warmup
      import tqdm
      
      BATCH_SIZE = 32
      EPOCHS = 5
      LEARNING_RATE = 3e-5
      WARMUP_STEPS = 5000
      MAX_SEQ_LEN = 500
      
      model.train()
      optimizer = AdamW(model.parameters(), lr=LEARNING_RATE)
      scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=WARMUP_STEPS, num_training_steps=-1)
      proc_seq_count = 0
      batch_count = 0
      
      tmp_jokes_tens = None
      
      for epoch in range(EPOCHS):
          
          print(f"EPOCH {epoch} started" + '=' * 30)
          progress_bar = tqdm.tqdm(joke_dataloader, desc="Training")
          
          for sample in progress_bar:
      
              sentence, tokens = sample
              
              #################### "Fit as many joke sequences into MAX_SEQ_LEN sequence as possible" logic start ####
              joke_tens = tokens.input_ids[0].to(device)
      
              # Skip sample from dataset if it is longer than MAX_SEQ_LEN
              if joke_tens.size()[1] > MAX_SEQ_LEN:
                  continue
              
              # The first joke sequence in the sequence
              if not torch.is_tensor(tmp_jokes_tens):
                  tmp_jokes_tens = joke_tens
                  continue
              else:
                  # The next joke does not fit in so we process the sequence and leave the last joke 
                  # as the start for next sequence 
                  if tmp_jokes_tens.size()[1] + joke_tens.size()[1] > MAX_SEQ_LEN:
                      work_jokes_tens = tmp_jokes_tens
                      tmp_jokes_tens = joke_tens
                  else:
                      #Add the joke to sequence, continue and try to add more
                      tmp_jokes_tens = torch.cat([tmp_jokes_tens, joke_tens[:,1:]], dim=1)
                      continue
              ################## Sequence ready, process it trough the model ##################
                  
              outputs = model(work_jokes_tens, labels=work_jokes_tens)
              loss = outputs.loss
              loss.backward()
                             
              proc_seq_count = proc_seq_count + 1
              if proc_seq_count == BATCH_SIZE:
                  proc_seq_count = 0    
                  batch_count += 1
                  optimizer.step()
                  scheduler.step() 
                  optimizer.zero_grad()
                  model.zero_grad()
      
              progress_bar.set_postfix({'loss': loss.item(), 'lr': scheduler.get_last_lr()[0]})
              if batch_count == 10:
                  batch_count = 0
      
/home/wallabot/miniconda3/envs/nlp/lib/python3.11/site-packages/transformers/optimization.py:429: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
        warnings.warn(
      
EPOCH 0 started==============================
      
Training: 100%|██████████| 231657/231657 [11:31<00:00, 334.88it/s, loss=2.88, lr=2.93e-6]
      
EPOCH 1 started==============================
      
Training: 100%|██████████| 231657/231657 [11:30<00:00, 335.27it/s, loss=2.49, lr=5.87e-6]
      
EPOCH 2 started==============================
      
Training: 100%|██████████| 231657/231657 [11:17<00:00, 341.75it/s, loss=2.57, lr=8.81e-6]
      
EPOCH 3 started==============================
      
Training: 100%|██████████| 231657/231657 [11:18<00:00, 341.27it/s, loss=2.41, lr=1.18e-5]
      
EPOCH 4 started==============================
      
Training: 100%|██████████| 231657/231657 [11:19<00:00, 341.04it/s, loss=2.49, lr=1.47e-5]
      

Inferencelink image 29

Let's see how well the model makes jokes.

	
from transformers import AdamW, get_linear_schedule_with_warmup
import tqdm
BATCH_SIZE = 32
EPOCHS = 5
LEARNING_RATE = 3e-5
WARMUP_STEPS = 5000
MAX_SEQ_LEN = 500
model.train()
optimizer = AdamW(model.parameters(), lr=LEARNING_RATE)
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=WARMUP_STEPS, num_training_steps=-1)
proc_seq_count = 0
batch_count = 0
tmp_jokes_tens = None
for epoch in range(EPOCHS):
print(f"EPOCH {epoch} started" + '=' * 30)
progress_bar = tqdm.tqdm(joke_dataloader, desc="Training")
for sample in progress_bar:
sentence, tokens = sample
#################### "Fit as many joke sequences into MAX_SEQ_LEN sequence as possible" logic start ####
joke_tens = tokens.input_ids[0].to(device)
# Skip sample from dataset if it is longer than MAX_SEQ_LEN
if joke_tens.size()[1] > MAX_SEQ_LEN:
continue
# The first joke sequence in the sequence
if not torch.is_tensor(tmp_jokes_tens):
tmp_jokes_tens = joke_tens
continue
else:
# The next joke does not fit in so we process the sequence and leave the last joke
# as the start for next sequence
if tmp_jokes_tens.size()[1] + joke_tens.size()[1] > MAX_SEQ_LEN:
work_jokes_tens = tmp_jokes_tens
tmp_jokes_tens = joke_tens
else:
#Add the joke to sequence, continue and try to add more
tmp_jokes_tens = torch.cat([tmp_jokes_tens, joke_tens[:,1:]], dim=1)
continue
################## Sequence ready, process it trough the model ##################
outputs = model(work_jokes_tens, labels=work_jokes_tens)
loss = outputs.loss
loss.backward()
proc_seq_count = proc_seq_count + 1
if proc_seq_count == BATCH_SIZE:
proc_seq_count = 0
batch_count += 1
optimizer.step()
scheduler.step()
optimizer.zero_grad()
model.zero_grad()
progress_bar.set_postfix({'loss': loss.item(), 'lr': scheduler.get_last_lr()[0]})
if batch_count == 10:
batch_count = 0
sentence_joke = "JOKE:"
input_tokens_joke = tokenizer(sentence_joke, return_tensors="pt").to(device)
output_tokens_joke = model.generate(**input_tokens_joke)
decoded_output_joke = tokenizer.decode(output_tokens_joke[0], skip_special_tokens=True)
print(f"decoded joke: {decoded_output_joke}")
Copy
	
/home/wallabot/miniconda3/envs/nlp/lib/python3.11/site-packages/transformers/optimization.py:429: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
warnings.warn(
decoded joke:
joke : what do you call a group of people who are not afraid of the dark? a group

You can see that you pass it a sequence with the word joke and it returns a joke. But if you return another sequence it does not

	
sentence_joke = "My dog is cute and"
input_tokens_joke = tokenizer(sentence_joke, return_tensors="pt").to(device)
output_tokens_joke = model.generate(**input_tokens_joke)
decoded_output_joke = tokenizer.decode(output_tokens_joke[0], skip_special_tokens=True)
print(f"decoded joke: {decoded_output_joke}")
Copy
	
decoded joke:
my dog is cute and i'm not sure if i should be offended or not. "

Continue reading

DoLa – Decoding by Contrasting Layers Improves Factuality in Large Language Models

DoLa – Decoding by Contrasting Layers Improves Factuality in Large Language Models

Have you ever talked to an LLM and they answered you something that sounds like they've been drinking machine coffee all night long 😂 That's what we call a hallucination in the LLM world! But don't worry, because it's not that your language model is crazy (although it can sometimes seem that way 🤪). The truth is that LLMs can be a bit... creative when it comes to generating text. But thanks to DoLa, a method that uses contrast layers to improve the feasibility of LLMs, we can keep our language models from turning into science fiction writers 😂. In this post, I'll explain how DoLa works and show you a code example so you can better understand how to make your LLMs more reliable and less prone to making up stories. Let's save our LLMs from insanity and make them more useful! 🚀

Last posts -->

Have you seen these projects?

Subtify

Subtify Subtify

Subtitle generator for videos in the language you want. Also, it puts a different color subtitle to each person

View all projects -->

Do you want to apply AI in your project? Contact me!

Do you want to improve with these tips?

Last tips -->

Use this locally

Hugging Face spaces allow us to run models with very simple demos, but what if the demo breaks? Or if the user deletes it? That's why I've created docker containers with some interesting spaces, to be able to use them locally, whatever happens. In fact, if you click on any project view button, it may take you to a space that doesn't work.

View all containers -->

Do you want to apply AI in your project? Contact me!

Do you want to train your model with these datasets?

short-jokes-dataset

Dataset with jokes in English

opus100

Dataset with translations from English to Spanish

netflix_titles

Dataset with Netflix movies and series

View more datasets -->