This notebook has been automatically translated to make it accessible to more people, please let me know if you see any typos.

Improving Language Understanding by Generative Pre-Training is the GPT1 paper. Before reading the post you need to put yourself in situation, before GPT language models were based on recurrent networks (RNN), which were networks that worked relatively well for specific tasks, but with which you could not reuse the pre-training to make them a fine tuning for other tasks. In addition, they did not have much memory, so if you put very long sentences in them, they did not remember the beginning of the sentence very well.

Architecturelink image 17

Before we talk about the architecture of GPT1, let's remember what the architecture of the Transformers was like.

transformer architecture

GPT1 is a model based on the transformer decoders, so as we do not have an encoder, the architecture of a single decoder is as follows

decoder architecture

The attention mechanism between the encoder and decoder sentence is eliminated.

In the GPT1 paper they propose the following architecture

gpt1 architecture

Which corresponds to the decoder of a transformer as we have seen before, executed 12 times

Paper abstractlink image 18

The most interesting ideas in the paper are:

  • The model is trained on a large corpus of unsupervised text. This is used to create a language model. A high-capacity language model is created on a large corpus of text.
  • Fine-tuning is then performed on supervised NLP tasks with labeled datasets. Fine-tuning is performed on a supervised target task. In addition, when the model is evaluated on the supervised task, it is not only evaluated on that task, but on how well it predicts the next token, this helps to improve the generalization of the supervised model and makes the model converge faster.
  • Although we have already mentioned it, the paper says that the transformer architecture is used, since up to that time RNNs were used for the language models. This resulted in an improvement in that what was learned in the first training (training on the unsupervised text corpus) is easier to transfer to supervised tasks. That is, thanks to the use of transformers, it was possible to train on a whole corpus of text and then fine-tune it in supervised tasks.
  • They evaluated the model in four types of language comprehension tasks:
    • Natural language inference
    • Answer to questions
    • Semantic similarity
    • Classification of texts.
  • The general model (the one trained on the entire unsupervised text corpus) outperforms discriminatively trained RNN models that employ task-specific designed architectures, significantly improving the state of the art in 9 of the 12 tasks studied. They also analyze the "zero-shot" behaviors of the pre-trained model in four different environments and showed that it acquires useful linguistic knowledge for subsequent tasks.
  • In recent years, researchers had demonstrated the benefits of using embeddings, which are trained on unlabeled corpora, to improve performance on a variety of tasks. However, these approaches primarily transfer information at the word level, whereas the use of transformers trained on large unsupervised text corpora captures higher-level, sentence-level semantics.

Text generationlink image 19

Let's see how to generate text with a pre-trained GPT1

First you have to install ftfy and spacy via

pip install ftfy spacy

Once installed, you must download the spacy language model you wish to use. For example, to download the English model, you can run:

python -m spacy download en_core_web_sm

To generate text we will use the model from the GPT1 repository of Hugging Face.

We import the libraries

import torch
from transformers import OpenAIGPTTokenizer, OpenAIGPTLMHeadModel, AutoTokenizer

If you notice we have imported OpenAIGPTTokenizer and AutoTokenizer. This is because in the model card of GPT1 it says to use OpenAIGPTTokenizer, but in the transformers library post we explain that you should use AutoTokenizer to load the tokenizer. So let's try both

import torch
from transformers import OpenAIGPTTokenizer, OpenAIGPTLMHeadModel, AutoTokenizer
ckeckpoints = "openai-community/openai-gpt"
tokenizer = OpenAIGPTTokenizer.from_pretrained(ckeckpoints)
auto_tokenizer = AutoTokenizer.from_pretrained(ckeckpoints)
input_tokens = tokenizer("Hello, my dog is cute and", return_tensors="pt")
input_auto_tokens = auto_tokenizer("Hello, my dog is cute and", return_tensors="pt")
print(f"input tokens: {input_tokens}")
print(f"input auto tokens: {input_auto_tokens}")
input tokens:
{'input_ids': tensor([[3570, 240, 547, 2585, 544, 4957, 488]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]])}
input auto tokens:
{'input_ids': tensor([[3570, 240, 547, 2585, 544, 4957, 488]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]])}

As you can see with the two tokenizers you get the same tokens. So to make the code more general, so that if you change the ckeckpoints, you don't have to change the code, let's use AutoTokenizer.

We then create the device, the tokenizer and the model

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = AutoTokenizer.from_pretrained(ckeckpoints)
model = OpenAIGPTLMHeadModel.from_pretrained(ckeckpoints).to(device)

As we have instantiated the model let's see how many parameters it has

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = AutoTokenizer.from_pretrained(ckeckpoints)
model = OpenAIGPTLMHeadModel.from_pretrained(ckeckpoints).to(device)
params = sum(p.numel() for p in model.parameters())
print(f"Number of parameters: {round(params/1e6)}M")
Number of parameters: 117M

At the time of billions of parameters, we can see that GPT1 only had 117 million parameters.

We create the input tokens for the model

input_sentence = "Hello, my dog is cute and"
input_tokens = tokenizer(input_sentence, return_tensors="pt").to(device)
{'input_ids': tensor([[3570, 240, 547, 2585, 544, 4957, 488]], device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]], device='cuda:0')}

We pass them to the model to generate the output tokens.

output_tokens = model.generate(**input_tokens)
      print(f"output tokens: \n{output_tokens}")
output tokens: 
      tensor([[ 3570,   240,   547,  2585,   544,  4957,   488,   249,   719,   797,
                 485,   921,   575,   562,   246,  1671,   239,   244, 40477,   244]],
/home/wallabot/miniconda3/envs/nlp/lib/python3.11/site-packages/transformers/generation/ UserWarning: Using the model-agnostic default `max_length` (=20) to control the generation length. We recommend setting `max_new_tokens` to control the maximum length of the generation.

We decode the tokens to obtain the output statement

output_tokens = model.generate(**input_tokens)
print(f"output tokens: {output_tokens}")
decoded_output = tokenizer.decode(output_tokens[0], skip_special_tokens=True)
print(f"decoded output: {decoded_output}")
output tokens:
tensor([[ 3570, 240, 547, 2585, 544, 4957, 488, 249, 719, 797,
485, 921, 575, 562, 246, 1671, 239, 244, 40477, 244]],
decoded output:
hello, my dog is cute and i'm going to take him for a walk. "

We have already succeeded in generating text with GPT1

Generate text token to tokenlink image 20

We have used model.generate to generate the output tokens all at once, but let's see how to generate them one by one. To do this, instead of using model.generate we are going to use model, which actually calls the model.forward method.

outputs = model(**input_tokens)
CausalLMOutput(loss=None, logits=tensor([[[ -5.9486, -5.8697, -18.4258, ..., -9.7371, -10.4495, 0.8814],
[ -6.1212, -4.8031, -14.3970, ..., -6.5411, -9.5051, -1.2015],
[ -7.4231, -6.3615, -14.7297, ..., -10.4575, -8.4600, -1.5183],
[ -5.4751, -5.8803, -13.7767, ..., -10.5048, -12.4167, -6.1584],
[ -7.2052, -6.0198, -21.5040, ..., -16.2941, -14.0494, -1.2416],
[ -7.7240, -7.3631, -17.3174, ..., -12.1546, -12.3327, -1.7169]]],
device='cuda:0', grad_fn=<UnsafeViewBackward0>), hidden_states=None, attentions=None)

We see that it pulls a lot of data, first let's look at the output keys


In this case we only have the logits of the model, let's see their size

logits = outputs.logits
torch.Size([1, 7, 40478])

Let's see how many tokens we had at the entrance.

torch.Size([1, 7])

Wow, at the output we have the same number of logits as at the input. This is normal

We obtain the logits of the last position of the exit

nex_token_logits = logits[0,-1]

There are a total of 40478 logits, i.e. there is a vocabulary of 40478 tokens and we have to see which token has the highest probability, to do this we first calculate the softmax

softmax_logits = torch.softmax(nex_token_logits, dim=0)
next_token_prob, next_token_id = torch.max(softmax_logits, dim=0)
next_token_prob, next_token_id
(tensor(0.1898, device='cuda:0', grad_fn=<MaxBackward0>),
tensor(249, device='cuda:0'))

We have obtained the following token, now we decode it


We have obtained the following token using the greedy method, i.e. the token with the highest probability. But we already saw in the transformers library post, the ways to generate texts that sampling, top-k, top-p, etc. can be done.

Let's put everything into a function and see what comes out if we generate a few tokens

def generate_next_greedy_token(input_sentence, tokenizer, model, device):
input_tokens = tokenizer(input_sentence, return_tensors="pt").to(device)
outputs = model(**input_tokens)
logits = outputs.logits
nex_token_logits = logits[0,-1]
softmax_logits = torch.softmax(nex_token_logits, dim=0)
next_token_prob, next_token_id = torch.max(softmax_logits, dim=0)
return next_token_prob, next_token_id
def generate_next_greedy_token(input_sentence, tokenizer, model, device):
input_tokens = tokenizer(input_sentence, return_tensors="pt").to(device)
outputs = model(**input_tokens)
logits = outputs.logits
nex_token_logits = logits[0,-1]
softmax_logits = torch.softmax(nex_token_logits, dim=0)
next_token_prob, next_token_id = torch.max(softmax_logits, dim=0)
return next_token_prob, next_token_id
def generate_greedy_text(input_sentence, tokenizer, model, device, max_length=20):
generated_text = input_sentence
for _ in range(max_length):
next_token_prob, next_token_id = generate_next_greedy_token(generated_text, tokenizer, model, device)
generated_text += tokenizer.decode(next_token_id.item())
return generated_text

Now we generate text

def generate_next_greedy_token(input_sentence, tokenizer, model, device):
input_tokens = tokenizer(input_sentence, return_tensors="pt").to(device)
outputs = model(**input_tokens)
logits = outputs.logits
nex_token_logits = logits[0,-1]
softmax_logits = torch.softmax(nex_token_logits, dim=0)
next_token_prob, next_token_id = torch.max(softmax_logits, dim=0)
return next_token_prob, next_token_id
def generate_greedy_text(input_sentence, tokenizer, model, device, max_length=20):
generated_text = input_sentence
for _ in range(max_length):
next_token_prob, next_token_id = generate_next_greedy_token(generated_text, tokenizer, model, device)
generated_text += tokenizer.decode(next_token_id.item())
return generated_text
generate_greedy_text("Hello, my dog is cute and", tokenizer, model, device)
'Hello, my dog is cute andi." '

The output is quite repetitive as already seen in the ways to generate text

Fine tuning GPTlink image 22

Loss calculationlink image 23

Before we start doing the fine tuning of GPT1 let's see one thing. Before when we used to get the output of the model we did this

outputs = model(**input_tokens)
CausalLMOutput(loss=None, logits=tensor([[[ -5.9486, -5.8697, -18.4258, ..., -9.7371, -10.4495, 0.8814],
[ -6.1212, -4.8031, -14.3970, ..., -6.5411, -9.5051, -1.2015],
[ -7.4231, -6.3615, -14.7297, ..., -10.4575, -8.4600, -1.5183],
[ -5.4751, -5.8803, -13.7767, ..., -10.5048, -12.4167, -6.1584],
[ -7.2052, -6.0198, -21.5040, ..., -16.2941, -14.0494, -1.2416],
[ -7.7240, -7.3631, -17.3174, ..., -12.1546, -12.3327, -1.7169]]],
device='cuda:0', grad_fn=<UnsafeViewBackward0>), hidden_states=None, attentions=None)

You can see that we get loss=None.


As we are going to need the loss to do the fine tuning, let's see how to obtain it.

If we go to the documentation of the method forward of OpenAIGPTLMHeadModel, we can see that it says that at the output it returns an object of type transformers.modeling_outputs.CausalLMOutput, so if we go to the documentation of transformers.modeling_outputs.CausalLMOutput, we can see that it says that it returns loss if labels is passed to the forward method.

If we go to the source code of the forward method, we see this code block

loss = None
              if labels is not None:
                  # Shift so that tokens < n predict n
                  shift_logits = lm_logits[..., :-1, :].contiguous()
                  shift_labels = labels[..., 1:].contiguous()
                  # Flatten the tokens
                  loss_fct = CrossEntropyLoss()
                  loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))
      In other words, the `loss` is calculated as follows
       * Shift of logits and labels: The first part is to shift the logits (`lm_logits`) and labels (`labels`) so that `tokens < n` predict `n`, i.e., from a position `n` the next token is predicted from the previous ones.
       * CrossEntropyLoss: An instance of the `CrossEntropyLoss()` function is created.
       * Flatten tokens: Logits and labels are then flattened using `view(-1, shift_logits.size(-1))` and `view(-1)`, respectively. This is done so that the logits and labels have the same shape for the loss function.
       * Loss calculation: Finally, the loss is calculated using the `CrossEntropyLoss()` function with the flattened logits and flattened labels as inputs.
      In summary, `loss` is calculated as the cross-entropy loss between shifted and flattened logits and shifted and flattened labels.
      Therefore, if we pass the labels to the `forward` method, it will return the `loss`.
outputs = model(**input_tokens, labels=input_tokens.input_ids)
tensor(4.2607, device='cuda:0', grad_fn=<NllLossBackward0>)

Datasetlink image 24

For the training we are going to use a dataset of English jokes short-jokes-dataset, which is a dataset with 231 thousand English jokes.

Download the dataset

from datasets import load_dataset
jokes = load_dataset("Maximofn/short-jokes-dataset")
train: Dataset({
features: ['ID', 'Joke'],
num_rows: 231657

Let's take a look at it

{'ID': 1,
'Joke': '[me narrating a documentary about narrators] "I can't hear what they're saying cuz I'm talking"'}

Pytorch traininglink image 25

First let's see how the pure Pytorch training would be done.

Restart the notebook to avoid problems with the GPU memory.

import torch
from transformers import OpenAIGPTLMHeadModel, AutoTokenizer
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
ckeckpoints = "openai-community/openai-gpt"
tokenizer = AutoTokenizer.from_pretrained(ckeckpoints)
model = OpenAIGPTLMHeadModel.from_pretrained(ckeckpoints)
model =

Pytorch datasetlink image 26

We create a Pytorch dataset class

import torch
from transformers import OpenAIGPTLMHeadModel, AutoTokenizer
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
ckeckpoints = "openai-community/openai-gpt"
tokenizer = AutoTokenizer.from_pretrained(ckeckpoints)
model = OpenAIGPTLMHeadModel.from_pretrained(ckeckpoints)
model =
from import Dataset
class JokesDataset(Dataset):
def __init__(self, dataset, tokenizer):
self.dataset = dataset
self.joke = "JOKE: "
self.end_of_text_token = "<|endoftext|>"
self.tokenizer = tokenizer
def __len__(self):
return len(self.dataset["train"])
def __getitem__(self, item):
sentence = self.joke + self.dataset["train"][item]["Joke"] + self.end_of_text_token
tokens = self.tokenizer(sentence, return_tensors="pt")
return sentence, tokens

We instantiate it

import torch
from transformers import OpenAIGPTLMHeadModel, AutoTokenizer
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
ckeckpoints = "openai-community/openai-gpt"
tokenizer = AutoTokenizer.from_pretrained(ckeckpoints)
model = OpenAIGPTLMHeadModel.from_pretrained(ckeckpoints)
model =
from import Dataset
class JokesDataset(Dataset):
def __init__(self, dataset, tokenizer):
self.dataset = dataset
self.joke = "JOKE: "
self.end_of_text_token = "<|endoftext|>"
self.tokenizer = tokenizer
def __len__(self):
return len(self.dataset["train"])
def __getitem__(self, item):
sentence = self.joke + self.dataset["train"][item]["Joke"] + self.end_of_text_token
tokens = self.tokenizer(sentence, return_tensors="pt")
return sentence, tokens
dataset = JokesDataset(jokes, tokenizer=tokenizer)

Here is an example

sentence, tokens = dataset[5]
      tokens.input_ids.shape, tokens.attention_mask.shape
JOKE: Why can't Barbie get pregnant? Because Ken comes in a different box. Heyooooooo<|endoftext|>
(torch.Size([1, 30]), torch.Size([1, 30]))

Dataloaderlink image 27

We now create a Pytorch dataloader

import torch
from transformers import OpenAIGPTLMHeadModel, AutoTokenizer
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
ckeckpoints = "openai-community/openai-gpt"
tokenizer = AutoTokenizer.from_pretrained(ckeckpoints)
model = OpenAIGPTLMHeadModel.from_pretrained(ckeckpoints)
model =
from import Dataset
class JokesDataset(Dataset):
def __init__(self, dataset, tokenizer):
self.dataset = dataset
self.joke = "JOKE: "
self.end_of_text_token = "<|endoftext|>"
self.tokenizer = tokenizer
def __len__(self):
return len(self.dataset["train"])
def __getitem__(self, item):
sentence = self.joke + self.dataset["train"][item]["Joke"] + self.end_of_text_token
tokens = self.tokenizer(sentence, return_tensors="pt")
return sentence, tokens
dataset = JokesDataset(jokes, tokenizer=tokenizer)
sentence, tokens = dataset[5]
tokens.input_ids.shape, tokens.attention_mask.shape
from import DataLoader
BS = 1
joke_dataloader = DataLoader(dataset, batch_size=BS, shuffle=True)

We see a batch

import torch
from transformers import OpenAIGPTLMHeadModel, AutoTokenizer
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
ckeckpoints = "openai-community/openai-gpt"
tokenizer = AutoTokenizer.from_pretrained(ckeckpoints)
model = OpenAIGPTLMHeadModel.from_pretrained(ckeckpoints)
model =
from import Dataset
class JokesDataset(Dataset):
def __init__(self, dataset, tokenizer):
self.dataset = dataset
self.joke = "JOKE: "
self.end_of_text_token = "<|endoftext|>"
self.tokenizer = tokenizer
def __len__(self):
return len(self.dataset["train"])
def __getitem__(self, item):
sentence = self.joke + self.dataset["train"][item]["Joke"] + self.end_of_text_token
tokens = self.tokenizer(sentence, return_tensors="pt")
return sentence, tokens
dataset = JokesDataset(jokes, tokenizer=tokenizer)
sentence, tokens = dataset[5]
tokens.input_ids.shape, tokens.attention_mask.shape
from import DataLoader
BS = 1
joke_dataloader = DataLoader(dataset, batch_size=BS, shuffle=True)
sentences, tokens = next(iter(joke_dataloader))
len(sentences), tokens.input_ids.shape, tokens.attention_mask.shape
JOKE: Why can't Barbie get pregnant? Because Ken comes in a different box. Heyooooooo<|endoftext|>
(1, torch.Size([1, 1, 29]), torch.Size([1, 1, 29]))

Traininglink image 28

from transformers import AdamW, get_linear_schedule_with_warmup
      import tqdm
      BATCH_SIZE = 32
      EPOCHS = 5
      LEARNING_RATE = 3e-5
      WARMUP_STEPS = 5000
      MAX_SEQ_LEN = 500
      optimizer = AdamW(model.parameters(), lr=LEARNING_RATE)
      scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=WARMUP_STEPS, num_training_steps=-1)
      proc_seq_count = 0
      batch_count = 0
      tmp_jokes_tens = None
      for epoch in range(EPOCHS):
          print(f"EPOCH {epoch} started" + '=' * 30)
          progress_bar = tqdm.tqdm(joke_dataloader, desc="Training")
          for sample in progress_bar:
              sentence, tokens = sample
              #################### "Fit as many joke sequences into MAX_SEQ_LEN sequence as possible" logic start ####
              joke_tens = tokens.input_ids[0].to(device)
              # Skip sample from dataset if it is longer than MAX_SEQ_LEN
              if joke_tens.size()[1] > MAX_SEQ_LEN:
              # The first joke sequence in the sequence
              if not torch.is_tensor(tmp_jokes_tens):
                  tmp_jokes_tens = joke_tens
                  # The next joke does not fit in so we process the sequence and leave the last joke 
                  # as the start for next sequence 
                  if tmp_jokes_tens.size()[1] + joke_tens.size()[1] > MAX_SEQ_LEN:
                      work_jokes_tens = tmp_jokes_tens
                      tmp_jokes_tens = joke_tens
                      #Add the joke to sequence, continue and try to add more
                      tmp_jokes_tens =[tmp_jokes_tens, joke_tens[:,1:]], dim=1)
              ################## Sequence ready, process it trough the model ##################
              outputs = model(work_jokes_tens, labels=work_jokes_tens)
              loss = outputs.loss
              proc_seq_count = proc_seq_count + 1
              if proc_seq_count == BATCH_SIZE:
                  proc_seq_count = 0    
                  batch_count += 1
              progress_bar.set_postfix({'loss': loss.item(), 'lr': scheduler.get_last_lr()[0]})
              if batch_count == 10:
                  batch_count = 0
/home/wallabot/miniconda3/envs/nlp/lib/python3.11/site-packages/transformers/ FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
EPOCH 0 started==============================
Training: 100%|██████████| 231657/231657 [11:31<00:00, 334.88it/s, loss=2.88, lr=2.93e-6]
EPOCH 1 started==============================
Training: 100%|██████████| 231657/231657 [11:30<00:00, 335.27it/s, loss=2.49, lr=5.87e-6]
EPOCH 2 started==============================
Training: 100%|██████████| 231657/231657 [11:17<00:00, 341.75it/s, loss=2.57, lr=8.81e-6]
EPOCH 3 started==============================
Training: 100%|██████████| 231657/231657 [11:18<00:00, 341.27it/s, loss=2.41, lr=1.18e-5]
EPOCH 4 started==============================
Training: 100%|██████████| 231657/231657 [11:19<00:00, 341.04it/s, loss=2.49, lr=1.47e-5]

Inferencelink image 29

Let's see how well the model makes jokes.

from transformers import AdamW, get_linear_schedule_with_warmup
import tqdm
optimizer = AdamW(model.parameters(), lr=LEARNING_RATE)
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=WARMUP_STEPS, num_training_steps=-1)
proc_seq_count = 0
batch_count = 0
tmp_jokes_tens = None
for epoch in range(EPOCHS):
print(f"EPOCH {epoch} started" + '=' * 30)
progress_bar = tqdm.tqdm(joke_dataloader, desc="Training")
for sample in progress_bar:
sentence, tokens = sample
#################### "Fit as many joke sequences into MAX_SEQ_LEN sequence as possible" logic start ####
joke_tens = tokens.input_ids[0].to(device)
# Skip sample from dataset if it is longer than MAX_SEQ_LEN
if joke_tens.size()[1] > MAX_SEQ_LEN:
# The first joke sequence in the sequence
if not torch.is_tensor(tmp_jokes_tens):
tmp_jokes_tens = joke_tens
# The next joke does not fit in so we process the sequence and leave the last joke
# as the start for next sequence
if tmp_jokes_tens.size()[1] + joke_tens.size()[1] > MAX_SEQ_LEN:
work_jokes_tens = tmp_jokes_tens
tmp_jokes_tens = joke_tens
#Add the joke to sequence, continue and try to add more
tmp_jokes_tens =[tmp_jokes_tens, joke_tens[:,1:]], dim=1)
################## Sequence ready, process it trough the model ##################
outputs = model(work_jokes_tens, labels=work_jokes_tens)
loss = outputs.loss
proc_seq_count = proc_seq_count + 1
if proc_seq_count == BATCH_SIZE:
proc_seq_count = 0
batch_count += 1
progress_bar.set_postfix({'loss': loss.item(), 'lr': scheduler.get_last_lr()[0]})
if batch_count == 10:
batch_count = 0
sentence_joke = "JOKE:"
input_tokens_joke = tokenizer(sentence_joke, return_tensors="pt").to(device)
output_tokens_joke = model.generate(**input_tokens_joke)
decoded_output_joke = tokenizer.decode(output_tokens_joke[0], skip_special_tokens=True)
print(f"decoded joke: {decoded_output_joke}")
/home/wallabot/miniconda3/envs/nlp/lib/python3.11/site-packages/transformers/ FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
decoded joke:
joke : what do you call a group of people who are not afraid of the dark? a group

You can see that you pass it a sequence with the word joke and it returns a joke. But if you return another sequence it does not

sentence_joke = "My dog is cute and"
input_tokens_joke = tokenizer(sentence_joke, return_tensors="pt").to(device)
output_tokens_joke = model.generate(**input_tokens_joke)
decoded_output_joke = tokenizer.decode(output_tokens_joke[0], skip_special_tokens=True)
print(f"decoded joke: {decoded_output_joke}")
decoded joke:
my dog is cute and i'm not sure if i should be offended or not. "

