GPT-2 – Language Models are Unsupervised Multitask Learners

GPT-2 – Language Models are Unsupervised Multitask Learners GPT-2 – Language Models are Unsupervised Multitask Learners

GPT-2 - Language Models are Unsupervised Multitask Learnerslink image 32

Paperlink image 33

This notebook has been automatically translated to make it accessible to more people, please let me know if you see any typos.

Language Models are Unsupervised Multitask Learners is the GPT-2 paper. This is the second version of the model GPT-1 that we have already seen.

Architecturelink image 34

Before we talk about the GPT-2 architecture, let's remember what the GPT-1 architecture was like.

gpt1 architecture

GPT-2 uses a transformer-based architecture, the same as GPT-1, with the following sizes

Parameters Layers d_model
117M 12 768
345M 24 1024
762M 36 1280
1542M 48 1600

The smaller model is equivalent to the original GPT, and the second smaller model is equivalent to the larger BERT model. The larger model has more than an order of magnitude more parameters than GPT.

In addition, the following architectural modifications were made

  • A normalization layer is added before the attention block. This can help stabilize the training of the model and improve its ability to learn deeper representations. By normalizing the inputs to each block, variability in the outputs is reduced and model training is facilitated.
  • An additional normalization has been added after the final self-attenuation block. This can help reduce variability in model outputs and improve model stability.
  • In most models, the weights of the layers are initialized randomly, following a normal or uniform distribution. However, in the case of GPT-2, the authors decided to use a modified initialization that takes into account the depth of the model.The idea behind this modified initialization is that as the model gets deeper, the signal flowing through the residual layers becomes weaker. This is because each residual layer is added to the original input, which can cause the signal to attenuate with the depth of the model. To counteract this effect they decided to scale the residual layer weights at initialization by a factor of 1/√N, where N is the number of residual layers. This means that as the model gets deeper, the residual layer weights become smaller. This initialization trick can help stabilize the training of the model and improve its ability to learn deeper representations. By scaling the residual layer weights, variability in the outputs of each layer is reduced and signal flow through the model is facilitated. In summary, the modified initialization in GPT-2 is used to counteract the signal attenuation effect in the residual layers, which helps to stabilize the training of the model and improve its ability to learn deeper representations.
  • The vocabulary size has been expanded to 50,257. This means that the model can learn to represent a larger set of words and tokens.
  • The context size has been increased from 512 to 1024 tokens. This allows the model to take into account a larger context when generating text.

GPT1 vs GPT-2 architecture](https://pub-fb664c455eca46a2ba762a065ac900f7.r2.dev/GPT1_vs_GPT2_architecture.webp)

Paper abstractlink image 35

The most interesting ideas in the paper are:

  • For pre-training of the model they thought to use a diverse and almost unlimited text source, web scraping as Common Crawl. However they found that there was almost very poor quality text. So they used the WebText dataset, which also came from web scraping but with a quality filter, such as the amount of outbound links from reddit, etc. They also removed the text coming from wikipedia, as it could be repeated in other pages.
  • They used a BPE tokenizer that we explained in a previous post.

Text generationlink image 36

Let's see how to generate text with a pre-trained GPT-2

To generate text we will use the model from the GPT-2 repository of Hugging Face.

Text generation with pipelinelink image 37

With this model we can now use the transformer pipeline.

from transformers import pipeline
      
      checkpoints = "openai-community/gpt2-xl"
      generator = pipeline('text-generation', model=checkpoints)
      output = generator("Hello, I'm a language model,", max_length=30, num_return_sequences=5)
      for i, o in enumerate(output):
          print(f"Output {i+1}: {o['generated_text']}")
      
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
      Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
      
Output 1: Hello, I'm a language model, and I want to change the way you read
      
      A little in today's post I want to talk about
      Output 2: Hello, I'm a language model, with two roles: the language model and the lexicographer-semantics expert. The language models are going
      Output 3: Hello, I'm a language model, and this is your brain. Here is your brain, and all this data that's stored in there, that
      Output 4: Hello, I'm a language model, and I like to talk... I want to help you talk to your customers
      
      Are you using language model
      Output 5: Hello, I'm a language model, I'm gonna tell you about what type of language you're using. We all know a language like this,
      

Text generation with automodellink image 38

But if we want to use Automodel, we can do the following

	
from transformers import pipeline
checkpoints = "openai-community/gpt2-xl"
generator = pipeline('text-generation', model=checkpoints)
output = generator("Hello, I'm a language model,", max_length=30, num_return_sequences=5)
for i, o in enumerate(output):
print(f"Output {i+1}: {o['generated_text']}")
import torch
from transformers import GPT2Tokenizer, AutoTokenizer
checkpoints = "openai-community/gpt2-xl"
tokenizer = GPT2Tokenizer.from_pretrained(checkpoints)
auto_tokenizer = AutoTokenizer.from_pretrained(checkpoints)
Copy

As with GPT-1 we can import GPT2Tokenizer and AutoTokenizer. This is because in the model card of GPT-2 it says to use GPT2Tokenizer, but in the transformers library post we explain that you should use AutoTokenizer to load the tokenizer. So let's try both

	
from transformers import pipeline
checkpoints = "openai-community/gpt2-xl"
generator = pipeline('text-generation', model=checkpoints)
output = generator("Hello, I'm a language model,", max_length=30, num_return_sequences=5)
for i, o in enumerate(output):
print(f"Output {i+1}: {o['generated_text']}")
import torch
from transformers import GPT2Tokenizer, AutoTokenizer
checkpoints = "openai-community/gpt2-xl"
tokenizer = GPT2Tokenizer.from_pretrained(checkpoints)
auto_tokenizer = AutoTokenizer.from_pretrained(checkpoints)
checkpoints = "openai-community/gpt2-xl"
tokenizer = GPT2Tokenizer.from_pretrained(checkpoints)
auto_tokenizer = AutoTokenizer.from_pretrained(checkpoints)
input_tokens = tokenizer("Hello, I'm a language model,", return_tensors="pt")
input_auto_tokens = auto_tokenizer("Hello, I'm a language model,", return_tensors="pt")
print(f"input tokens: {input_tokens}")
print(f"input auto tokens: {input_auto_tokens}")
Copy
	
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
input tokens:
{'input_ids': tensor([[15496, 11, 314, 1101, 257, 3303, 2746, 11]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1]])}
input auto tokens:
{'input_ids': tensor([[15496, 11, 314, 1101, 257, 3303, 2746, 11]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1]])}

As you can see with the two tokenizers you get the same tokens. So to make the code more general, so that if you change the ckeckpoints, you don't have to change the code, let's use AutoTokenizer.

We then create the device, the tokenizer and the model

	
import torch
from transformers import AutoTokenizer, GPT2LMHeadModel
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
checkpoints = "openai-community/gpt2-xl"
tokenizer = AutoTokenizer.from_pretrained(checkpoints)
model = GPT2LMHeadModel.from_pretrained(checkpoints).to(device)
Copy

Since we have instantiated the model, let's see how many parameters it has

	
import torch
from transformers import AutoTokenizer, GPT2LMHeadModel
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
checkpoints = "openai-community/gpt2-xl"
tokenizer = AutoTokenizer.from_pretrained(checkpoints)
model = GPT2LMHeadModel.from_pretrained(checkpoints).to(device)
params = sum(p.numel() for p in model.parameters())
print(f"Number of parameters: {round(params/1e6)}M")
Copy
	
Number of parameters: 1558M

As we can see we have loaded the 1.5B parameter model, but if we wanted to load the other models we would have to do the following

	
checkpoints_small = "openai-community/gpt2"
model_small = GPT2LMHeadModel.from_pretrained(checkpoints_small)
print(f"Number of parameters of small model: {round(sum(p.numel() for p in model_small.parameters())/1e6)}M")
checkpoints_medium = "openai-community/gpt2-medium"
model_medium = GPT2LMHeadModel.from_pretrained(checkpoints_medium)
print(f"Number of parameters of medium model: {round(sum(p.numel() for p in model_medium.parameters())/1e6)}M")
checkpoints_large = "openai-community/gpt2-large"
model_large = GPT2LMHeadModel.from_pretrained(checkpoints_large)
print(f"Number of parameters of large model: {round(sum(p.numel() for p in model_large.parameters())/1e6)}M")
checkpoints_xl = "openai-community/gpt2-xl"
model_xl = GPT2LMHeadModel.from_pretrained(checkpoints_xl)
print(f"Number of parameters of xl model: {round(sum(p.numel() for p in model_xl.parameters())/1e6)}M")
Copy
	
Number of parameters of small model: 124M
Number of parameters of medium model: 355M
Number of parameters of large model: 774M
Number of parameters of xl model: 1558M

We create the input tokens for the model

	
input_sentence = "Hello, I'm a language model,"
input_tokens = tokenizer(input_sentence, return_tensors="pt").to(device)
input_tokens
Copy
	
{'input_ids': tensor([[15496, 11, 314, 1101, 257, 3303, 2746, 11]],
device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1]], device='cuda:0')}

We pass them to the model to generate the output tokens.

output_tokens = model.generate(**input_tokens)
      
      print(f"output tokens: \n{output_tokens}")
      
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
      /home/wallabot/miniconda3/envs/nlp/lib/python3.11/site-packages/transformers/generation/utils.py:1178: UserWarning: Using the model-agnostic default `max_length` (=20) to control the generation length. We recommend setting `max_new_tokens` to control the maximum length of the generation.
        warnings.warn(
      
output tokens: 
      tensor([[15496,    11,   314,  1101,   257,  3303,  2746,    11,   290,   314,
                1101,  1016,   284,  1037,   345,   351,   534,  1917,    13,   198]],
             device='cuda:0')
      

We decode the tokens to obtain the output sentence

	
output_tokens = model.generate(**input_tokens)
print(f"output tokens: {output_tokens}")
decoded_output = tokenizer.decode(output_tokens[0], skip_special_tokens=True)
print(f"decoded output: {decoded_output}")
Copy
	
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
/home/wallabot/miniconda3/envs/nlp/lib/python3.11/site-packages/transformers/generation/utils.py:1178: UserWarning: Using the model-agnostic default `max_length` (=20) to control the generation length. We recommend setting `max_new_tokens` to control the maximum length of the generation.
warnings.warn(
decoded output:
Hello, I'm a language model, and I'm going to help you with your problem.

We have already succeeded in generating text with GPT-2

Generate text token to tokenlink image 39

We have used model.generate to generate the output tokens all at once, but let's see how to generate them one by one. To do this, instead of using model.generate we are going to use model, which actually calls the model.forward method.

	
outputs = model(**input_tokens)
outputs
Copy
	
CausalLMOutputWithCrossAttentions(loss=None, logits=tensor([[[ 6.6288, 5.1421, -0.8002, ..., -6.3998, -4.4113, 1.8240],
[ 2.7250, 1.9371, -1.2293, ..., -5.0979, -5.1617, 2.2694],
[ 2.6891, 4.3089, -1.6074, ..., -7.6321, -2.0448, 0.4042],
...,
[ 6.0513, 3.8020, -2.8080, ..., -6.7754, -8.3176, 1.1541],
[ 6.8402, 5.6952, 0.2002, ..., -9.1281, -6.7818, 2.7576],
[ 1.0255, -0.2201, -2.5484, ..., -6.2137, -7.2322, 0.1665]]],
device='cuda:0', grad_fn=<UnsafeViewBackward0>), past_key_values=((tensor([[[[ 0.4779, 0.7671, -0.7532, ..., -0.3551, 0.4590, 0.3073],
[ 0.2034, -0.6033, 0.2484, ..., 0.7760, -0.3546, 0.0198],
[-0.1968, -0.9029, 0.5570, ..., 0.9985, -0.5028, -0.3508],
...,
[-0.5007, -0.4009, 0.1604, ..., -0.3693, -0.1158, 0.1320],
[-0.4854, -0.1369, 0.7377, ..., -0.8043, -0.1054, 0.0871],
[ 0.1610, -0.8358, -0.5534, ..., 0.9951, -0.3085, 0.4574]],
[[ 0.6288, -0.1374, -0.3467, ..., -1.0003, -1.1518, 0.3114],
[-1.7269, 1.2920, -0.0734, ..., 1.0572, 1.4698, -2.0412],
[ 0.2714, -0.0670, -0.4769, ..., 0.6305, 0.6890, -0.8158],
...,
[-0.0499, -0.0721, 0.4580, ..., 0.6797, 0.2331, 0.0210],
[-0.1894, 0.2077, 0.6722, ..., 0.6938, 0.2104, -0.0574],
[ 0.3661, -0.0218, 0.2618, ..., 0.8750, 1.2205, -0.6103]],
[[ 0.5964, 1.1178, 0.3604, ..., 0.8426, 0.4881, -0.4094],
[ 0.3186, -0.3953, 0.2687, ..., -0.1110, -0.5640, 0.5900],
...,
[ 0.2092, 0.3898, -0.6061, ..., -0.2859, -0.3136, -0.1002],
[ 0.0539, 0.8941, 0.3423, ..., -0.6326, -0.1053, -0.6679],
[ 0.5628, 0.6687, -0.2720, ..., -0.1073, -0.9792, -0.0302]]]],
device='cuda:0', grad_fn=<PermuteBackward0>))), hidden_states=None, attentions=None, cross_attentions=None)

We see that it brings up a lot of data, first let's look at the output keys

	
outputs.keys()
Copy
	
odict_keys(['logits', 'past_key_values'])

In this case we only have the logits of the model, let's see their size

	
logits = outputs.logits
logits.shape
Copy
	
torch.Size([1, 8, 50257])

Let's see how many tokens we had at the entrance.

	
input_tokens.input_ids.shape
Copy
	
torch.Size([1, 8])

Wow, at the output we have the same number of logits as at the input. This is normal

We obtain the logits of the last position of the exit

	
nex_token_logits = logits[0,-1]
nex_token_logits.shape
Copy
	
torch.Size([50257])

There are a total of 50257 logits, i.e. there is a vocabulary of 50257 tokens and we have to see which token has the highest probability, to do this we first calculate the softmax

	
softmax_logits = torch.softmax(nex_token_logits, dim=0)
softmax_logits.shape
Copy
	
torch.Size([50257])

Once we have calculated the softmax we obtain the most probable token by looking for the one with the highest probability, that is, the one with the highest value after the softmax.

	
next_token_prob, next_token_id = torch.max(softmax_logits, dim=0)
next_token_prob, next_token_id
Copy
	
(tensor(0.1732, device='cuda:0', grad_fn=<MaxBackward0>),
tensor(290, device='cuda:0'))

We have obtained the following token, now we decode it

	
tokenizer.decode(next_token_id.item())
Copy
	
' and'

We have obtained the following token using the greedy method, that is, the token with the highest probability. But we already saw in the transformers library post, the ways to generate texts that you can do sampling, top-k, top-p, etc.

Let's put everything into a function and see what comes out if we generate a few tokens

	
def generate_next_greedy_token(input_sentence, tokenizer, model, device):
input_tokens = tokenizer(input_sentence, return_tensors="pt").to(device)
outputs = model(**input_tokens)
logits = outputs.logits
nex_token_logits = logits[0,-1]
softmax_logits = torch.softmax(nex_token_logits, dim=0)
next_token_prob, next_token_id = torch.max(softmax_logits, dim=0)
return next_token_prob, next_token_id
Copy
	
def generate_next_greedy_token(input_sentence, tokenizer, model, device):
input_tokens = tokenizer(input_sentence, return_tensors="pt").to(device)
outputs = model(**input_tokens)
logits = outputs.logits
nex_token_logits = logits[0,-1]
softmax_logits = torch.softmax(nex_token_logits, dim=0)
next_token_prob, next_token_id = torch.max(softmax_logits, dim=0)
return next_token_prob, next_token_id
def generate_greedy_text(input_sentence, tokenizer, model, device, max_length=20):
generated_text = input_sentence
for _ in range(max_length):
next_token_prob, next_token_id = generate_next_greedy_token(generated_text, tokenizer, model, device)
generated_text += tokenizer.decode(next_token_id.item())
return generated_text
Copy

Now we generate text

	
def generate_next_greedy_token(input_sentence, tokenizer, model, device):
input_tokens = tokenizer(input_sentence, return_tensors="pt").to(device)
outputs = model(**input_tokens)
logits = outputs.logits
nex_token_logits = logits[0,-1]
softmax_logits = torch.softmax(nex_token_logits, dim=0)
next_token_prob, next_token_id = torch.max(softmax_logits, dim=0)
return next_token_prob, next_token_id
def generate_greedy_text(input_sentence, tokenizer, model, device, max_length=20):
generated_text = input_sentence
for _ in range(max_length):
next_token_prob, next_token_id = generate_next_greedy_token(generated_text, tokenizer, model, device)
generated_text += tokenizer.decode(next_token_id.item())
return generated_text
generate_greedy_text("Hello, I'm a language model,", tokenizer, model, device)
Copy
	
"Hello, I'm a language model, and I'm going to help you with your problem. I'm going to help you"

The output is rather repetitive as already seen in ways to generate text. But still, it is better output than what we obtained with GPT-1.

Architecture of the models available at Hugging Facelink image 41

If we go to the Hugging Face documentation of GPT2 we can see that we have the options GPT2Model, GPT2LMHeadModel, GPT2ForSequenceClassification, GPT2ForQuestionAnswering, GPT2ForTokenClassification. Let's take a look at them

	
import torch
ckeckpoints = "openai-community/gpt2"
Copy

GPT2Modellink image 42

This is the base model, i.e. the transformer decoder.

	
import torch
ckeckpoints = "openai-community/gpt2"
from transformers import GPT2Model
model = GPT2Model.from_pretrained(ckeckpoints)
model
Copy
	
GPT2Model(
(wte): Embedding(50257, 768)
(wpe): Embedding(1024, 768)
(drop): Dropout(p=0.1, inplace=False)
(h): ModuleList(
(0-11): 12 x GPT2Block(
(ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(attn): GPT2Attention(
(c_attn): Conv1D()
(c_proj): Conv1D()
(attn_dropout): Dropout(p=0.1, inplace=False)
(resid_dropout): Dropout(p=0.1, inplace=False)
)
(ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(mlp): GPT2MLP(
(c_fc): Conv1D()
(c_proj): Conv1D()
(act): NewGELUActivation()
(dropout): Dropout(p=0.1, inplace=False)
)
)
)
(ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
)

As you can see at the output a tensor of dimension 768, which is the dimension of the embeddings of the small model. If we had used the openai-community/gpt2-xl model, we would have obtained an output of 1600.

Depending on the task you want to do now you would have to add more layers.

We can add them by hand, but the weights of those layers would be initialized randomly. Whereas if we use the Hugging Face models with these layers, the weights are pre-trained.

GPT2LMHeadModellink image 43

It is the one we have used before to generate text

	
from transformers import GPT2LMHeadModel
model = GPT2LMHeadModel.from_pretrained(ckeckpoints)
model
Copy
	
GPT2LMHeadModel(
(transformer): GPT2Model(
(wte): Embedding(50257, 768)
(wpe): Embedding(1024, 768)
(drop): Dropout(p=0.1, inplace=False)
(h): ModuleList(
(0-11): 12 x GPT2Block(
(ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(attn): GPT2Attention(
(c_attn): Conv1D()
(c_proj): Conv1D()
(attn_dropout): Dropout(p=0.1, inplace=False)
(resid_dropout): Dropout(p=0.1, inplace=False)
)
(ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(mlp): GPT2MLP(
(c_fc): Conv1D()
(c_proj): Conv1D()
(act): NewGELUActivation()
(dropout): Dropout(p=0.1, inplace=False)
)
)
)
(ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
)
(lm_head): Linear(in_features=768, out_features=50257, bias=False)
)

As you can see it is the same model as before, only that at the end a linear layer has been added with an input of 768 (the embeddings) and an output of 50257, which corresponds to the size of the vocabulary.

GPT2ForSequenceClassificationlink image 44

This option is for classifying text sequences, in this case we have to specify with num_labels the number of classes we want to classify.

from transformers import GPT2ForSequenceClassification
      model = GPT2ForSequenceClassification.from_pretrained(ckeckpoints, num_labels=5)
      model
      
Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at openai-community/gpt2 and are newly initialized: ['score.weight']
      You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
      
Out[10]:
GPT2ForSequenceClassification(
        (transformer): GPT2Model(
          (wte): Embedding(50257, 768)
          (wpe): Embedding(1024, 768)
          (drop): Dropout(p=0.1, inplace=False)
          (h): ModuleList(
            (0-11): 12 x GPT2Block(
              (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
              (attn): GPT2Attention(
                (c_attn): Conv1D()
                (c_proj): Conv1D()
                (attn_dropout): Dropout(p=0.1, inplace=False)
                (resid_dropout): Dropout(p=0.1, inplace=False)
              )
              (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
              (mlp): GPT2MLP(
                (c_fc): Conv1D()
                (c_proj): Conv1D()
                (act): NewGELUActivation()
                (dropout): Dropout(p=0.1, inplace=False)
              )
            )
          )
          (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        )
        (score): Linear(in_features=768, out_features=5, bias=False)
      )

Now, instead of having an output of 50257, we have an output of 5, which is the number we have entered in num_labels and is the number of classes we want to classify

GPT2ForQuestionAnsweringlink image 45

In the transformers post we explained that in this mode you pass a context to the model and a question about the context and it returns the answer.

from transformers import GPT2ForQuestionAnswering
      model = GPT2ForQuestionAnswering.from_pretrained(ckeckpoints)
      model
      
Some weights of GPT2ForQuestionAnswering were not initialized from the model checkpoint at openai-community/gpt2 and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
      You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
      
Out[13]:
GPT2ForQuestionAnswering(
        (transformer): GPT2Model(
          (wte): Embedding(50257, 768)
          (wpe): Embedding(1024, 768)
          (drop): Dropout(p=0.1, inplace=False)
          (h): ModuleList(
            (0-11): 12 x GPT2Block(
              (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
              (attn): GPT2Attention(
                (c_attn): Conv1D()
                (c_proj): Conv1D()
                (attn_dropout): Dropout(p=0.1, inplace=False)
                (resid_dropout): Dropout(p=0.1, inplace=False)
              )
              (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
              (mlp): GPT2MLP(
                (c_fc): Conv1D()
                (c_proj): Conv1D()
                (act): NewGELUActivation()
                (dropout): Dropout(p=0.1, inplace=False)
              )
            )
          )
          (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        )
        (qa_outputs): Linear(in_features=768, out_features=2, bias=True)
      )

We see that the output gives us a two-dimensional tensor

GPT2ForTokenClassificationlink image 46

Also in the post of transformers we told what token calsification was, we explained that it classified to which category each token corresponded. We have to pass the number of classes we want to classify with num_labels.

from transformers import GPT2ForTokenClassification
      model = GPT2ForTokenClassification.from_pretrained(ckeckpoints, num_labels=5)
      model
      
Some weights of GPT2ForTokenClassification were not initialized from the model checkpoint at openai-community/gpt2 and are newly initialized: ['classifier.bias', 'classifier.weight']
      You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
      
Out[2]:
GPT2ForTokenClassification(
        (transformer): GPT2Model(
          (wte): Embedding(50257, 768)
          (wpe): Embedding(1024, 768)
          (drop): Dropout(p=0.1, inplace=False)
          (h): ModuleList(
            (0-11): 12 x GPT2Block(
              (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
              (attn): GPT2Attention(
                (c_attn): Conv1D()
                (c_proj): Conv1D()
                (attn_dropout): Dropout(p=0.1, inplace=False)
                (resid_dropout): Dropout(p=0.1, inplace=False)
              )
              (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
              (mlp): GPT2MLP(
                (c_fc): Conv1D()
                (c_proj): Conv1D()
                (act): NewGELUActivation()
                (dropout): Dropout(p=0.1, inplace=False)
              )
            )
          )
          (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        )
        (dropout): Dropout(p=0.1, inplace=False)
        (classifier): Linear(in_features=768, out_features=5, bias=True)
      )

At the output we obtain the 5 classes that we have specified with num_labels.

Fine tuning GPT-2link image 47

Fine tuning for text generationlink image 48

First let's see how the pure Pytorch training would be done.

Loss calculationlink image 49

Before we start doing the fine tuning of GPT-2 let's see one thing. Before when we used to get the output of the model we did this

	
from transformers import GPT2ForSequenceClassification
model = GPT2ForSequenceClassification.from_pretrained(ckeckpoints, num_labels=5)
model
from transformers import GPT2ForQuestionAnswering
model = GPT2ForQuestionAnswering.from_pretrained(ckeckpoints)
model
from transformers import GPT2ForTokenClassification
model = GPT2ForTokenClassification.from_pretrained(ckeckpoints, num_labels=5)
model
outputs = model(**input_tokens)
outputs
Copy
	
Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at openai-community/gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of GPT2ForQuestionAnswering were not initialized from the model checkpoint at openai-community/gpt2 and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of GPT2ForTokenClassification were not initialized from the model checkpoint at openai-community/gpt2 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
CausalLMOutputWithCrossAttentions(loss=None, logits=tensor([[[ 6.6288, 5.1421, -0.8002, ..., -6.3998, -4.4113, 1.8240],
[ 2.7250, 1.9371, -1.2293, ..., -5.0979, -5.1617, 2.2694],
[ 2.6891, 4.3089, -1.6074, ..., -7.6321, -2.0448, 0.4042],
...,
[ 6.0513, 3.8020, -2.8080, ..., -6.7754, -8.3176, 1.1541],
[ 6.8402, 5.6952, 0.2002, ..., -9.1281, -6.7818, 2.7576],
[ 1.0255, -0.2201, -2.5484, ..., -6.2137, -7.2322, 0.1665]]],
device='cuda:0', grad_fn=<UnsafeViewBackward0>), past_key_values=((tensor([[[[ 0.4779, 0.7671, -0.7532, ..., -0.3551, 0.4590, 0.3073],
[ 0.2034, -0.6033, 0.2484, ..., 0.7760, -0.3546, 0.0198],
[-0.1968, -0.9029, 0.5570, ..., 0.9985, -0.5028, -0.3508],
...,
[-0.5007, -0.4009, 0.1604, ..., -0.3693, -0.1158, 0.1320],
[-0.4854, -0.1369, 0.7377, ..., -0.8043, -0.1054, 0.0871],
[ 0.1610, -0.8358, -0.5534, ..., 0.9951, -0.3085, 0.4574]],
[[ 0.6288, -0.1374, -0.3467, ..., -1.0003, -1.1518, 0.3114],
[-1.7269, 1.2920, -0.0734, ..., 1.0572, 1.4698, -2.0412],
[ 0.2714, -0.0670, -0.4769, ..., 0.6305, 0.6890, -0.8158],
...,
[-0.0499, -0.0721, 0.4580, ..., 0.6797, 0.2331, 0.0210],
[-0.1894, 0.2077, 0.6722, ..., 0.6938, 0.2104, -0.0574],
[ 0.3661, -0.0218, 0.2618, ..., 0.8750, 1.2205, -0.6103]],
[[ 0.5964, 1.1178, 0.3604, ..., 0.8426, 0.4881, -0.4094],
[ 0.3186, -0.3953, 0.2687, ..., -0.1110, -0.5640, 0.5900],
...,
[ 0.2092, 0.3898, -0.6061, ..., -0.2859, -0.3136, -0.1002],
[ 0.0539, 0.8941, 0.3423, ..., -0.6326, -0.1053, -0.6679],
[ 0.5628, 0.6687, -0.2720, ..., -0.1073, -0.9792, -0.0302]]]],
device='cuda:0', grad_fn=<PermuteBackward0>))), hidden_states=None, attentions=None, cross_attentions=None)

You can see that we get loss=None.

	
print(outputs.loss)
Copy
	
None

As we are going to need the loss to do the fine tuning, let's see how to obtain it.

If we go to the documentation of the method forward of GPT2LMHeadModel, we can see that it says that at the output it returns an object of type transformers.modeling_outputs.CausalLMOutputWithCrossAttentions, so if we go to the documentation of transformers.modeling_outputs.CausalLMOutputWithCrossAttentions, we can see that it says that it returns loss if labels is passed to the forward method.

If we go to the source code of the forward method, we see this code block

loss = None
              if labels is not None:
                  # move labels to correct device to enable model parallelism
                  labels = labels.to(lm_logits.device)
                  # Shift so that tokens < n predict n
                  shift_logits = lm_logits[..., :-1, :].contiguous()
                  shift_labels = labels[..., 1:].contiguous()
                  # Flatten the tokens
                  loss_fct = CrossEntropyLoss()
                  loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))
      ```
      
      In other words, the `loss` is calculated as follows
      
       * Shift of logits and labels: The first part is to shift the logits (`lm_logits`) and labels (`labels`) so that `tokens < n` predict `n`, i.e., from a position `n` the next token is predicted from the previous ones.
       * CrossEntropyLoss: An instance of the `CrossEntropyLoss()` function is created.
       * Flatten tokens: Logits and labels are then flattened using `view(-1, shift_logits.size(-1))` and `view(-1)`, respectively. This is done so that the logits and labels have the same shape for the loss function.
       * Loss calculation: Finally, the loss is calculated using the `CrossEntropyLoss()` function with the flattened logits and flattened labels as inputs.
      
      In summary, `loss` is calculated as the cross-entropy loss between shifted and flattened logits and shifted and flattened labels.
      
      Therefore, if we pass the labels to the `forward` method, it will return the `loss`.
      
	
outputs = model(**input_tokens, labels=input_tokens.input_ids)
outputs.loss
Copy
	
tensor(3.8028, device='cuda:0', grad_fn=<NllLossBackward0>)

Datasetlink image 50

For the training we are going to use a dataset of English jokes short-jokes-dataset, which is a dataset with 231 thousand English jokes.

Restart the notebook to avoid problems with the GPU memory.

Download the dataset

	
from datasets import load_dataset
jokes = load_dataset("Maximofn/short-jokes-dataset")
jokes
Copy
	
DatasetDict({
train: Dataset({
features: ['ID', 'Joke'],
num_rows: 231657
})
})

Let's take a look at it

	
jokes["train"][0]
Copy
	
{'ID': 1,
'Joke': '[me narrating a documentary about narrators] "I can't hear what they're saying cuz I'm talking"'}

Model instancelink image 51

In order to use the xl model, i.e. the one with 1.5B parameters, I change it to FP16 to avoid running out of memory.

	
import torch
from transformers import AutoTokenizer, GPT2LMHeadModel
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
ckeckpoints = "openai-community/gpt2-xl"
tokenizer = AutoTokenizer.from_pretrained(ckeckpoints)
model = GPT2LMHeadModel.from_pretrained(ckeckpoints)
model = model.half().to(device)
Copy

Pytorch datasetlink image 52

We create a Pytorch dataset class

	
import torch
from transformers import AutoTokenizer, GPT2LMHeadModel
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
ckeckpoints = "openai-community/gpt2-xl"
tokenizer = AutoTokenizer.from_pretrained(ckeckpoints)
model = GPT2LMHeadModel.from_pretrained(ckeckpoints)
model = model.half().to(device)
from torch.utils.data import Dataset
class JokesDataset(Dataset):
def __init__(self, dataset, tokenizer):
self.dataset = dataset
self.joke = "JOKE: "
self.end_of_text_token = "<|endoftext|>"
self.tokenizer = tokenizer
def __len__(self):
return len(self.dataset["train"])
def __getitem__(self, item):
sentence = self.joke + self.dataset["train"][item]["Joke"] + self.end_of_text_token
tokens = self.tokenizer(sentence, return_tensors="pt")
return sentence, tokens
Copy

We instantiate it

	
import torch
from transformers import AutoTokenizer, GPT2LMHeadModel
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
ckeckpoints = "openai-community/gpt2-xl"
tokenizer = AutoTokenizer.from_pretrained(ckeckpoints)
model = GPT2LMHeadModel.from_pretrained(ckeckpoints)
model = model.half().to(device)
from torch.utils.data import Dataset
class JokesDataset(Dataset):
def __init__(self, dataset, tokenizer):
self.dataset = dataset
self.joke = "JOKE: "
self.end_of_text_token = "<|endoftext|>"
self.tokenizer = tokenizer
def __len__(self):
return len(self.dataset["train"])
def __getitem__(self, item):
sentence = self.joke + self.dataset["train"][item]["Joke"] + self.end_of_text_token
tokens = self.tokenizer(sentence, return_tensors="pt")
return sentence, tokens
dataset = JokesDataset(jokes, tokenizer=tokenizer)
Copy

Here is an example

sentence, tokens = dataset[5]
      print(sentence)
      tokens.input_ids.shape, tokens.attention_mask.shape
      
JOKE: Why can't Barbie get pregnant? Because Ken comes in a different box. Heyooooooo<|endoftext|>
      
Out[5]:
(torch.Size([1, 22]), torch.Size([1, 22]))

Dataloaderlink image 53

We now create a Pytorch dataloader

	
import torch
from transformers import AutoTokenizer, GPT2LMHeadModel
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
ckeckpoints = "openai-community/gpt2-xl"
tokenizer = AutoTokenizer.from_pretrained(ckeckpoints)
model = GPT2LMHeadModel.from_pretrained(ckeckpoints)
model = model.half().to(device)
from torch.utils.data import Dataset
class JokesDataset(Dataset):
def __init__(self, dataset, tokenizer):
self.dataset = dataset
self.joke = "JOKE: "
self.end_of_text_token = "<|endoftext|>"
self.tokenizer = tokenizer
def __len__(self):
return len(self.dataset["train"])
def __getitem__(self, item):
sentence = self.joke + self.dataset["train"][item]["Joke"] + self.end_of_text_token
tokens = self.tokenizer(sentence, return_tensors="pt")
return sentence, tokens
dataset = JokesDataset(jokes, tokenizer=tokenizer)
sentence, tokens = dataset[5]
print(sentence)
tokens.input_ids.shape, tokens.attention_mask.shape
from torch.utils.data import DataLoader
BS = 1
joke_dataloader = DataLoader(dataset, batch_size=BS, shuffle=True)
Copy

We see a batch

	
import torch
from transformers import AutoTokenizer, GPT2LMHeadModel
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
ckeckpoints = "openai-community/gpt2-xl"
tokenizer = AutoTokenizer.from_pretrained(ckeckpoints)
model = GPT2LMHeadModel.from_pretrained(ckeckpoints)
model = model.half().to(device)
from torch.utils.data import Dataset
class JokesDataset(Dataset):
def __init__(self, dataset, tokenizer):
self.dataset = dataset
self.joke = "JOKE: "
self.end_of_text_token = "<|endoftext|>"
self.tokenizer = tokenizer
def __len__(self):
return len(self.dataset["train"])
def __getitem__(self, item):
sentence = self.joke + self.dataset["train"][item]["Joke"] + self.end_of_text_token
tokens = self.tokenizer(sentence, return_tensors="pt")
return sentence, tokens
dataset = JokesDataset(jokes, tokenizer=tokenizer)
sentence, tokens = dataset[5]
print(sentence)
tokens.input_ids.shape, tokens.attention_mask.shape
from torch.utils.data import DataLoader
BS = 1
joke_dataloader = DataLoader(dataset, batch_size=BS, shuffle=True)
sentences, tokens = next(iter(joke_dataloader))
len(sentences), tokens.input_ids.shape, tokens.attention_mask.shape
Copy
	
JOKE: Why can't Barbie get pregnant? Because Ken comes in a different box. Heyooooooo<|endoftext|>
(1, torch.Size([1, 1, 36]), torch.Size([1, 1, 36]))

Traininglink image 54

from transformers import AdamW, get_linear_schedule_with_warmup
      import tqdm
      
      BATCH_SIZE = 32
      EPOCHS = 5
      LEARNING_RATE = 3e-6
      WARMUP_STEPS = 5000
      MAX_SEQ_LEN = 500
      
      optimizer = AdamW(model.parameters(), lr=LEARNING_RATE)
      scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=WARMUP_STEPS, num_training_steps=-1)
      proc_seq_count = 0
      batch_count = 0
      
      tmp_jokes_tens = None
      
      losses = []
      lrs = []
      
      for epoch in range(EPOCHS):
          
          print(f"EPOCH {epoch} started" + '=' * 30)
          progress_bar = tqdm.tqdm(joke_dataloader, desc="Training")
          
          for sample in progress_bar:
      
              sentence, tokens = sample
              
              #################### "Fit as many joke sequences into MAX_SEQ_LEN sequence as possible" logic start ####
              joke_tens = tokens.input_ids[0].to(device)
      
              # Skip sample from dataset if it is longer than MAX_SEQ_LEN
              if joke_tens.size()[1] > MAX_SEQ_LEN:
                  continue
              
              # The first joke sequence in the sequence
              if not torch.is_tensor(tmp_jokes_tens):
                  tmp_jokes_tens = joke_tens
                  continue
              else:
                  # The next joke does not fit in so we process the sequence and leave the last joke 
                  # as the start for next sequence 
                  if tmp_jokes_tens.size()[1] + joke_tens.size()[1] > MAX_SEQ_LEN:
                      work_jokes_tens = tmp_jokes_tens
                      tmp_jokes_tens = joke_tens
                  else:
                      #Add the joke to sequence, continue and try to add more
                      tmp_jokes_tens = torch.cat([tmp_jokes_tens, joke_tens[:,1:]], dim=1)
                      continue
              ################## Sequence ready, process it trough the model ##################
                  
              outputs = model(work_jokes_tens, labels=work_jokes_tens)
              loss = outputs.loss
              loss.backward()
                          
              proc_seq_count = proc_seq_count + 1
              if proc_seq_count == BATCH_SIZE:
                  proc_seq_count = 0    
                  batch_count += 1
                  optimizer.step()
                  scheduler.step() 
                  optimizer.zero_grad()
                  model.zero_grad()
      
              progress_bar.set_postfix({'loss': loss.item(), 'lr': scheduler.get_last_lr()[0]})
              losses.append(loss.item())
              lrs.append(scheduler.get_last_lr()[0])
              if batch_count == 10:
                  batch_count = 0
      
/home/wallabot/miniconda3/envs/nlp/lib/python3.11/site-packages/transformers/optimization.py:429: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
        warnings.warn(
      
EPOCH 0 started==============================
      
Training:   0%|          | 0/231657 [00:00<?, ?it/s]
Training: 100%|██████████| 231657/231657 [32:29<00:00, 118.83it/s, loss=3.1, lr=2.31e-7] 
      
EPOCH 1 started==============================
      
Training: 100%|██████████| 231657/231657 [32:34<00:00, 118.55it/s, loss=2.19, lr=4.62e-7]
      
EPOCH 2 started==============================
      
Training: 100%|██████████| 231657/231657 [32:36<00:00, 118.42it/s, loss=2.42, lr=6.93e-7]
      
EPOCH 3 started==============================
      
Training: 100%|██████████| 231657/231657 [32:23<00:00, 119.18it/s, loss=2.16, lr=9.25e-7]
      
EPOCH 4 started==============================
      
Training: 100%|██████████| 231657/231657 [32:22<00:00, 119.25it/s, loss=2.1, lr=1.16e-6] 
      
import numpy as np
      import matplotlib.pyplot as plt
      
      losses_np = np.array(losses)
      lrs_np = np.array(lrs)
      
      plt.figure(figsize=(12,6))
      plt.plot(losses_np, label='loss')
      plt.plot(lrs_np, label='learning rate')
      plt.yscale('log')
      plt.legend()
      plt.show()
      
image gpt2 1

Inferencelink image 55

Let's see how well the model makes jokes.

sentence_joke = "JOKE:"
      input_tokens_joke = tokenizer(sentence_joke, return_tensors="pt").to(device)
      output_tokens_joke = model.generate(**input_tokens_joke)
      decoded_output_joke = tokenizer.decode(output_tokens_joke[0], skip_special_tokens=True)
      
      print(f"decoded joke: \n{decoded_output_joke}")
      
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
      /home/wallabot/miniconda3/envs/nlp/lib/python3.11/site-packages/transformers/generation/utils.py:1178: UserWarning: Using the model-agnostic default `max_length` (=20) to control the generation length. We recommend setting `max_new_tokens` to control the maximum length of the generation.
        warnings.warn(
      
decoded joke: 
      JOKE:!!!!!!!!!!!!!!!!!
      

You can see that you pass it a sequence with the word joke and it returns a joke. But if you return another sequence it does not

sentence_joke = "My dog is cute and"
      input_tokens_joke = tokenizer(sentence_joke, return_tensors="pt").to(device)
      output_tokens_joke = model.generate(**input_tokens_joke)
      decoded_output_joke = tokenizer.decode(output_tokens_joke[0], skip_special_tokens=True)
      
      print(f"decoded joke: \n{decoded_output_joke}")
      
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
      
decoded joke: 
      My dog is cute and!!!!!!!!!!!!!!!
      

Fine tuning GPT-2 for sentence classificationlink image 56

Now we are going to do a training with the Hugging Face libraries.

Datasetlink image 57

Let's use the imdb dataset to classify statements into positive and negative ones

	
from transformers import AdamW, get_linear_schedule_with_warmup
import tqdm
BATCH_SIZE = 32
EPOCHS = 5
LEARNING_RATE = 3e-6
WARMUP_STEPS = 5000
MAX_SEQ_LEN = 500
optimizer = AdamW(model.parameters(), lr=LEARNING_RATE)
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=WARMUP_STEPS, num_training_steps=-1)
proc_seq_count = 0
batch_count = 0
tmp_jokes_tens = None
losses = []
lrs = []
for epoch in range(EPOCHS):
print(f"EPOCH {epoch} started" + '=' * 30)
progress_bar = tqdm.tqdm(joke_dataloader, desc="Training")
for sample in progress_bar:
sentence, tokens = sample
#################### "Fit as many joke sequences into MAX_SEQ_LEN sequence as possible" logic start ####
joke_tens = tokens.input_ids[0].to(device)
# Skip sample from dataset if it is longer than MAX_SEQ_LEN
if joke_tens.size()[1] > MAX_SEQ_LEN:
continue
# The first joke sequence in the sequence
if not torch.is_tensor(tmp_jokes_tens):
tmp_jokes_tens = joke_tens
continue
else:
# The next joke does not fit in so we process the sequence and leave the last joke
# as the start for next sequence
if tmp_jokes_tens.size()[1] + joke_tens.size()[1] > MAX_SEQ_LEN:
work_jokes_tens = tmp_jokes_tens
tmp_jokes_tens = joke_tens
else:
#Add the joke to sequence, continue and try to add more
tmp_jokes_tens = torch.cat([tmp_jokes_tens, joke_tens[:,1:]], dim=1)
continue
################## Sequence ready, process it trough the model ##################
outputs = model(work_jokes_tens, labels=work_jokes_tens)
loss = outputs.loss
loss.backward()
proc_seq_count = proc_seq_count + 1
if proc_seq_count == BATCH_SIZE:
proc_seq_count = 0
batch_count += 1
optimizer.step()
scheduler.step()
optimizer.zero_grad()
model.zero_grad()
progress_bar.set_postfix({'loss': loss.item(), 'lr': scheduler.get_last_lr()[0]})
losses.append(loss.item())
lrs.append(scheduler.get_last_lr()[0])
if batch_count == 10:
batch_count = 0
import numpy as np
import matplotlib.pyplot as plt
losses_np = np.array(losses)
lrs_np = np.array(lrs)
plt.figure(figsize=(12,6))
plt.plot(losses_np, label='loss')
plt.plot(lrs_np, label='learning rate')
plt.yscale('log')
plt.legend()
plt.show()
sentence_joke = "JOKE:"
input_tokens_joke = tokenizer(sentence_joke, return_tensors="pt").to(device)
output_tokens_joke = model.generate(**input_tokens_joke)
decoded_output_joke = tokenizer.decode(output_tokens_joke[0], skip_special_tokens=True)
print(f"decoded joke: {decoded_output_joke}")
sentence_joke = "My dog is cute and"
input_tokens_joke = tokenizer(sentence_joke, return_tensors="pt").to(device)
output_tokens_joke = model.generate(**input_tokens_joke)
decoded_output_joke = tokenizer.decode(output_tokens_joke[0], skip_special_tokens=True)
print(f"decoded joke: {decoded_output_joke}")
from datasets import load_dataset
dataset = load_dataset("imdb")
dataset
Copy
	
/home/wallabot/miniconda3/envs/nlp/lib/python3.11/site-packages/transformers/optimization.py:429: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
warnings.warn(
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
/home/wallabot/miniconda3/envs/nlp/lib/python3.11/site-packages/transformers/generation/utils.py:1178: UserWarning: Using the model-agnostic default `max_length` (=20) to control the generation length. We recommend setting `max_new_tokens` to control the maximum length of the generation.
warnings.warn(
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
DatasetDict({
train: Dataset({
features: ['text', 'label'],
num_rows: 25000
})
test: Dataset({
features: ['text', 'label'],
num_rows: 25000
})
unsupervised: Dataset({
features: ['text', 'label'],
num_rows: 50000
})
})

Let's take a look at it

	
dataset["train"].info
Copy
	
DatasetInfo(description='', citation='', homepage='', license='', features={'text': Value(dtype='string', id=None), 'label': ClassLabel(names=['neg', 'pos'], id=None)}, post_processed=None, supervised_keys=None, task_templates=None, builder_name='parquet', dataset_name='imdb', config_name='plain_text', version=0.0.0, splits={'train': SplitInfo(name='train', num_bytes=33435948, num_examples=25000, shard_lengths=None, dataset_name='imdb'), 'test': SplitInfo(name='test', num_bytes=32653810, num_examples=25000, shard_lengths=None, dataset_name='imdb'), 'unsupervised': SplitInfo(name='unsupervised', num_bytes=67113044, num_examples=50000, shard_lengths=None, dataset_name='imdb')}, download_checksums={'hf://datasets/imdb@e6281661ce1c48d982bc483cf8a173c1bbeb5d31/plain_text/train-00000-of-00001.parquet': {'num_bytes': 20979968, 'checksum': None}, 'hf://datasets/imdb@e6281661ce1c48d982bc483cf8a173c1bbeb5d31/plain_text/test-00000-of-00001.parquet': {'num_bytes': 20470363, 'checksum': None}, 'hf://datasets/imdb@e6281661ce1c48d982bc483cf8a173c1bbeb5d31/plain_text/unsupervised-00000-of-00001.parquet': {'num_bytes': 41996509, 'checksum': None}}, download_size=83446840, post_processing_size=None, dataset_size=133202802, size_in_bytes=216649642)

Let's take a look at the features of this dataset

	
dataset["train"].info.features
Copy
	
{'text': Value(dtype='string', id=None),
'label': ClassLabel(names=['neg', 'pos'], id=None)}

The dataset contains strings and classes. In addition there are two types of classes, pos and neg. Let's create a variable with the number of classes

	
num_clases = len(dataset["train"].unique("label"))
num_clases
Copy
	
2

Tokenizerlink image 58

We create the tokenizer

	
from transformers import GPT2Tokenizer
checkpoints = "openai-community/gpt2"
tokenizer = GPT2Tokenizer.from_pretrained(checkpoints, bos_token='<|startoftext|>', eos_token='<|endoftext|>', pad_token='<|pad|>')
tokenizer.pad_token = tokenizer.eos_token
Copy

Now that we have a tokenizer we can tokenize the dataset, since the model only understands tokens.

	
from transformers import GPT2Tokenizer
checkpoints = "openai-community/gpt2"
tokenizer = GPT2Tokenizer.from_pretrained(checkpoints, bos_token='<|startoftext|>', eos_token='<|endoftext|>', pad_token='<|pad|>')
tokenizer.pad_token = tokenizer.eos_token
def tokenize_function(examples):
return tokenizer(examples["text"], padding="max_length", truncation=True)
tokenized_datasets = dataset.map(tokenize_function, batched=True)
Copy

Modellink image 59

We instantiate the model

	
from transformers import GPT2Tokenizer
checkpoints = "openai-community/gpt2"
tokenizer = GPT2Tokenizer.from_pretrained(checkpoints, bos_token='<|startoftext|>', eos_token='<|endoftext|>', pad_token='<|pad|>')
tokenizer.pad_token = tokenizer.eos_token
def tokenize_function(examples):
return tokenizer(examples["text"], padding="max_length", truncation=True)
tokenized_datasets = dataset.map(tokenize_function, batched=True)
from transformers import GPT2ForSequenceClassification
model = GPT2ForSequenceClassification.from_pretrained(checkpoints, num_labels=num_clases).half()
model.config.pad_token_id = model.config.eos_token_id
Copy
	
Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at openai-community/gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Evaluationlink image 60

We create an evaluation metric

	
import numpy as np
import evaluate
metric = evaluate.load("accuracy")
def compute_metrics(eval_pred):
logits, labels = eval_pred
predictions = np.argmax(logits, axis=-1)
return metric.compute(predictions=predictions, references=labels)
Copy

Trainerlink image 61

We create the trainer

	
import numpy as np
import evaluate
metric = evaluate.load("accuracy")
def compute_metrics(eval_pred):
logits, labels = eval_pred
predictions = np.argmax(logits, axis=-1)
return metric.compute(predictions=predictions, references=labels)
from transformers import Trainer, TrainingArguments
training_args = TrainingArguments(
output_dir="./results",
learning_rate=2e-5,
per_device_train_batch_size=16,
per_device_eval_batch_size=64,
num_train_epochs=3,
weight_decay=0.01,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["test"],
compute_metrics=compute_metrics,
)
Copy

Traininglink image 62

We train

trainer.train()
      
[4689/4689 1:27:50, Epoch 3/3]
Step Training Loss
500 0.379400
1000 0.000000
1500 0.000000
2000 0.000000
2500 0.000000
3000 0.000000
3500 0.000000
4000 0.000000
4500 0.000000

Out[25]:
TrainOutput(global_step=4689, training_loss=0.04045845954294626, metrics={'train_runtime': 5271.3532, 'train_samples_per_second': 14.228, 'train_steps_per_second': 0.89, 'total_flos': 3.91945125888e+16, 'train_loss': 0.04045845954294626, 'epoch': 3.0})

Inferencelink image 63

We tested the model after training

	
import numpy as np
import evaluate
metric = evaluate.load("accuracy")
def compute_metrics(eval_pred):
logits, labels = eval_pred
predictions = np.argmax(logits, axis=-1)
return metric.compute(predictions=predictions, references=labels)
from transformers import Trainer, TrainingArguments
training_args = TrainingArguments(
output_dir="./results",
learning_rate=2e-5,
per_device_train_batch_size=16,
per_device_eval_batch_size=64,
num_train_epochs=3,
weight_decay=0.01,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["test"],
compute_metrics=compute_metrics,
)
trainer.train()
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
def get_sentiment(sentence):
inputs = tokenizer(sentence, return_tensors="pt").to(device)
outputs = model(**inputs)
prediction = outputs.logits.argmax(-1).item()
return "positive" if prediction == 1 else "negative"
Copy
	
import numpy as np
import evaluate
metric = evaluate.load("accuracy")
def compute_metrics(eval_pred):
logits, labels = eval_pred
predictions = np.argmax(logits, axis=-1)
return metric.compute(predictions=predictions, references=labels)
from transformers import Trainer, TrainingArguments
training_args = TrainingArguments(
output_dir="./results",
learning_rate=2e-5,
per_device_train_batch_size=16,
per_device_eval_batch_size=64,
num_train_epochs=3,
weight_decay=0.01,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["test"],
compute_metrics=compute_metrics,
)
trainer.train()
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
def get_sentiment(sentence):
inputs = tokenizer(sentence, return_tensors="pt").to(device)
outputs = model(**inputs)
prediction = outputs.logits.argmax(-1).item()
return "positive" if prediction == 1 else "negative"
sentence = "I hate this movie!"
print(get_sentiment(sentence))
Copy
	
negative

Continue reading

DoLa – Decoding by Contrasting Layers Improves Factuality in Large Language Models

DoLa – Decoding by Contrasting Layers Improves Factuality in Large Language Models

Have you ever talked to an LLM and they answered you something that sounds like they've been drinking machine coffee all night long 😂 That's what we call a hallucination in the LLM world! But don't worry, because it's not that your language model is crazy (although it can sometimes seem that way 🤪). The truth is that LLMs can be a bit... creative when it comes to generating text. But thanks to DoLa, a method that uses contrast layers to improve the feasibility of LLMs, we can keep our language models from turning into science fiction writers 😂. In this post, I'll explain how DoLa works and show you a code example so you can better understand how to make your LLMs more reliable and less prone to making up stories. Let's save our LLMs from insanity and make them more useful! 🚀

Last posts -->

Have you seen these projects?

Subtify

Subtify Subtify

Subtitle generator for videos in the language you want. Also, it puts a different color subtitle to each person

View all projects -->

Do you want to apply AI in your project? Contact me!

Do you want to improve with these tips?

Last tips -->

Use this locally

Hugging Face spaces allow us to run models with very simple demos, but what if the demo breaks? Or if the user deletes it? That's why I've created docker containers with some interesting spaces, to be able to use them locally, whatever happens. In fact, if you click on any project view button, it may take you to a space that doesn't work.

View all containers -->

Do you want to apply AI in your project? Contact me!

Do you want to train your model with these datasets?

short-jokes-dataset

Dataset with jokes in English

opus100

Dataset with translations from English to Spanish

netflix_titles

Dataset with Netflix movies and series

View more datasets -->