GPT-2 – Language Models are Unsupervised Multitask Learners

09 of july of 2024

GPT-2 - Language Models are Unsupervised Multitask Learners

Paper

This notebook has been automatically translated to make it accessible to more people, please let me know if you see any typos.

Language Models are Unsupervised Multitask Learners is the GPT-2 paper. This is the second version of the model GPT-1 that we have already seen.

Architecture

Before we talk about the GPT-2 architecture, let's remember what the GPT-1 architecture was like.

gpt1 architecture

GPT-2 uses a transformer-based architecture, the same as GPT-1, with the following sizes

Parameters	Layers	d_model
117M	12	768
345M	24	1024
762M	36	1280
1542M	48	1600

The smaller model is equivalent to the original GPT, and the second smaller model is equivalent to the larger BERT model. The larger model has more than an order of magnitude more parameters than GPT.

In addition, the following architectural modifications were made

A normalization layer is added before the attention block. This can help stabilize the training of the model and improve its ability to learn deeper representations. By normalizing the inputs to each block, variability in the outputs is reduced and model training is facilitated.
An additional normalization has been added after the final self-attenuation block. This can help reduce variability in model outputs and improve model stability.
In most models, the weights of the layers are initialized randomly, following a normal or uniform distribution. However, in the case of GPT-2, the authors decided to use a modified initialization that takes into account the depth of the model.The idea behind this modified initialization is that as the model gets deeper, the signal flowing through the residual layers becomes weaker. This is because each residual layer is added to the original input, which can cause the signal to attenuate with the depth of the model. To counteract this effect they decided to scale the residual layer weights at initialization by a factor of 1/√N, where N is the number of residual layers. This means that as the model gets deeper, the residual layer weights become smaller. This initialization trick can help stabilize the training of the model and improve its ability to learn deeper representations. By scaling the residual layer weights, variability in the outputs of each layer is reduced and signal flow through the model is facilitated. In summary, the modified initialization in GPT-2 is used to counteract the signal attenuation effect in the residual layers, which helps to stabilize the training of the model and improve its ability to learn deeper representations.
The vocabulary size has been expanded to 50,257. This means that the model can learn to represent a larger set of words and tokens.
The context size has been increased from 512 to 1024 tokens. This allows the model to take into account a larger context when generating text.

GPT1 vs GPT-2 architecture](https://pub-fb664c455eca46a2ba762a065ac900f7.r2.dev/GPT1_vs_GPT2_architecture.webp)

Paper abstract

The most interesting ideas in the paper are:

For pre-training of the model they thought to use a diverse and almost unlimited text source, web scraping as Common Crawl. However they found that there was almost very poor quality text. So they used the WebText dataset, which also came from web scraping but with a quality filter, such as the amount of outbound links from reddit, etc. They also removed the text coming from wikipedia, as it could be repeated in other pages.
They used a BPE tokenizer that we explained in a previous post.

Text generation

Let's see how to generate text with a pre-trained GPT-2

To generate text we will use the model from the GPT-2 repository of Hugging Face.

Text generation with pipeline

With this model we can now use the transformer pipeline.

from transformers import pipeline
      
      checkpoints = "openai-community/gpt2-xl"
      generator = pipeline('text-generation', model=checkpoints)
      output = generator("Hello, I'm a language model,", max_length=30, num_return_sequences=5)
      for i, o in enumerate(output):
          print(f"Output {i+1}: {o['generated_text']}")

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
      Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.

Output 1: Hello, I'm a language model, and I want to change the way you read
      
      A little in today's post I want to talk about
      Output 2: Hello, I'm a language model, with two roles: the language model and the lexicographer-semantics expert. The language models are going
      Output 3: Hello, I'm a language model, and this is your brain. Here is your brain, and all this data that's stored in there, that
      Output 4: Hello, I'm a language model, and I like to talk... I want to help you talk to your customers
      
      Are you using language model
      Output 5: Hello, I'm a language model, I'm gonna tell you about what type of language you're using. We all know a language like this,

Text generation with automodel

But if we want to use Automodel, we can do the following

	
		from transformers import pipeline
      
      checkpoints = "openai-community/gpt2-xl"
      generator = pipeline('text-generation', model=checkpoints)
      output = generator("Hello, I'm a language model,", max_length=30, num_return_sequences=5)
      for i, o in enumerate(output):
          print(f"Output {i+1}: {o['generated_text']}")
import torch
      from transformers import GPT2Tokenizer, AutoTokenizer
      
      checkpoints = "openai-community/gpt2-xl"
      tokenizer = GPT2Tokenizer.from_pretrained(checkpoints)
      auto_tokenizer = AutoTokenizer.from_pretrained(checkpoints)

As with GPT-1 we can import GPT2Tokenizer and AutoTokenizer. This is because in the model card of GPT-2 it says to use GPT2Tokenizer, but in the transformers library post we explain that you should use AutoTokenizer to load the tokenizer. So let's try both

	
		from transformers import pipeline
checkpoints = "openai-community/gpt2-xl"
generator = pipeline('text-generation', model=checkpoints)
output = generator("Hello, I'm a language model,", max_length=30, num_return_sequences=5)
for i, o in enumerate(output):
    print(f"Output {i+1}: {o['generated_text']}")
import torch
from transformers import GPT2Tokenizer, AutoTokenizer
checkpoints = "openai-community/gpt2-xl"
tokenizer = GPT2Tokenizer.from_pretrained(checkpoints)
auto_tokenizer = AutoTokenizer.from_pretrained(checkpoints)
checkpoints = "openai-community/gpt2-xl"
tokenizer = GPT2Tokenizer.from_pretrained(checkpoints)
auto_tokenizer = AutoTokenizer.from_pretrained(checkpoints)
input_tokens = tokenizer("Hello, I'm a language model,", return_tensors="pt")
input_auto_tokens = auto_tokenizer("Hello, I'm a language model,", return_tensors="pt")
print(f"input tokens: 
{input_tokens}")
print(f"input auto tokens: 
{input_auto_tokens}")

	
		Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
input tokens: 
{'input_ids': tensor([[15496,    11,   314,  1101,   257,  3303,  2746,    11]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1]])}
input auto tokens: 
{'input_ids': tensor([[15496,    11,   314,  1101,   257,  3303,  2746,    11]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1]])}

As you can see with the two tokenizers you get the same tokens. So to make the code more general, so that if you change the ckeckpoints, you don't have to change the code, let's use AutoTokenizer.

We then create the device, the tokenizer and the model

	
		import torch
      from transformers import AutoTokenizer, GPT2LMHeadModel
      
      device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
      
      checkpoints = "openai-community/gpt2-xl"
      tokenizer = AutoTokenizer.from_pretrained(checkpoints)
      model = GPT2LMHeadModel.from_pretrained(checkpoints).to(device)

Since we have instantiated the model, let's see how many parameters it has

	
		import torch
from transformers import AutoTokenizer, GPT2LMHeadModel
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
checkpoints = "openai-community/gpt2-xl"
tokenizer = AutoTokenizer.from_pretrained(checkpoints)
model = GPT2LMHeadModel.from_pretrained(checkpoints).to(device)
params = sum(p.numel() for p in model.parameters())
print(f"Number of parameters: {round(params/1e6)}M")

	
		Number of parameters: 1558M

As we can see we have loaded the 1.5B parameter model, but if we wanted to load the other models we would have to do the following

	
		checkpoints_small = "openai-community/gpt2"
model_small = GPT2LMHeadModel.from_pretrained(checkpoints_small)
print(f"Number of parameters of small model: {round(sum(p.numel() for p in model_small.parameters())/1e6)}M")
checkpoints_medium = "openai-community/gpt2-medium"
model_medium = GPT2LMHeadModel.from_pretrained(checkpoints_medium)
print(f"Number of parameters of medium model: {round(sum(p.numel() for p in model_medium.parameters())/1e6)}M")
checkpoints_large = "openai-community/gpt2-large"
model_large = GPT2LMHeadModel.from_pretrained(checkpoints_large)
print(f"Number of parameters of large model: {round(sum(p.numel() for p in model_large.parameters())/1e6)}M")
checkpoints_xl = "openai-community/gpt2-xl"
model_xl = GPT2LMHeadModel.from_pretrained(checkpoints_xl)
print(f"Number of parameters of xl model: {round(sum(p.numel() for p in model_xl.parameters())/1e6)}M")

	
		Number of parameters of small model: 124M
Number of parameters of medium model: 355M
Number of parameters of large model: 774M
Number of parameters of xl model: 1558M

We create the input tokens for the model

	
		input_sentence = "Hello, I'm a language model,"
input_tokens = tokenizer(input_sentence, return_tensors="pt").to(device)
input_tokens

	
		{'input_ids': tensor([[15496,    11,   314,  1101,   257,  3303,  2746,    11]],
       device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1]], device='cuda:0')}

We pass them to the model to generate the output tokens.

output_tokens = model.generate(**input_tokens)
      
      print(f"output tokens: \n{output_tokens}")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
      /home/wallabot/miniconda3/envs/nlp/lib/python3.11/site-packages/transformers/generation/utils.py:1178: UserWarning: Using the model-agnostic default `max_length` (=20) to control the generation length. We recommend setting `max_new_tokens` to control the maximum length of the generation.
        warnings.warn(

output tokens: 
      tensor([[15496,    11,   314,  1101,   257,  3303,  2746,    11,   290,   314,
                1101,  1016,   284,  1037,   345,   351,   534,  1917,    13,   198]],
             device='cuda:0')

We decode the tokens to obtain the output sentence

	
		output_tokens = model.generate(**input_tokens)
print(f"output tokens: 
{output_tokens}")
decoded_output = tokenizer.decode(output_tokens[0], skip_special_tokens=True)
print(f"decoded output: 
{decoded_output}")

	
		Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
/home/wallabot/miniconda3/envs/nlp/lib/python3.11/site-packages/transformers/generation/utils.py:1178: UserWarning: Using the model-agnostic default `max_length` (=20) to control the generation length. We recommend setting `max_new_tokens` to control the maximum length of the generation.
  warnings.warn(
decoded output: 
Hello, I'm a language model, and I'm going to help you with your problem.

We have already succeeded in generating text with GPT-2

Generate text token to token

Greedy search

We have used model.generate to generate the output tokens all at once, but let's see how to generate them one by one. To do this, instead of using model.generate we are going to use model, which actually calls the model.forward method.

	
		outputs = model(**input_tokens)
outputs

	
		CausalLMOutputWithCrossAttentions(loss=None, logits=tensor([[[ 6.6288,  5.1421, -0.8002,  ..., -6.3998, -4.4113,  1.8240],
         [ 2.7250,  1.9371, -1.2293,  ..., -5.0979, -5.1617,  2.2694],
         [ 2.6891,  4.3089, -1.6074,  ..., -7.6321, -2.0448,  0.4042],
         ...,
         [ 6.0513,  3.8020, -2.8080,  ..., -6.7754, -8.3176,  1.1541],
         [ 6.8402,  5.6952,  0.2002,  ..., -9.1281, -6.7818,  2.7576],
         [ 1.0255, -0.2201, -2.5484,  ..., -6.2137, -7.2322,  0.1665]]],
       device='cuda:0', grad_fn=<UnsafeViewBackward0>), past_key_values=((tensor([[[[ 0.4779,  0.7671, -0.7532,  ..., -0.3551,  0.4590,  0.3073],
          [ 0.2034, -0.6033,  0.2484,  ...,  0.7760, -0.3546,  0.0198],
          [-0.1968, -0.9029,  0.5570,  ...,  0.9985, -0.5028, -0.3508],
          ...,
          [-0.5007, -0.4009,  0.1604,  ..., -0.3693, -0.1158,  0.1320],
          [-0.4854, -0.1369,  0.7377,  ..., -0.8043, -0.1054,  0.0871],
          [ 0.1610, -0.8358, -0.5534,  ...,  0.9951, -0.3085,  0.4574]],
         [[ 0.6288, -0.1374, -0.3467,  ..., -1.0003, -1.1518,  0.3114],
          [-1.7269,  1.2920, -0.0734,  ...,  1.0572,  1.4698, -2.0412],
          [ 0.2714, -0.0670, -0.4769,  ...,  0.6305,  0.6890, -0.8158],
          ...,
          [-0.0499, -0.0721,  0.4580,  ...,  0.6797,  0.2331,  0.0210],
          [-0.1894,  0.2077,  0.6722,  ...,  0.6938,  0.2104, -0.0574],
          [ 0.3661, -0.0218,  0.2618,  ...,  0.8750,  1.2205, -0.6103]],
         [[ 0.5964,  1.1178,  0.3604,  ...,  0.8426,  0.4881, -0.4094],
          [ 0.3186, -0.3953,  0.2687,  ..., -0.1110, -0.5640,  0.5900],
          ...,
          [ 0.2092,  0.3898, -0.6061,  ..., -0.2859, -0.3136, -0.1002],
          [ 0.0539,  0.8941,  0.3423,  ..., -0.6326, -0.1053, -0.6679],
          [ 0.5628,  0.6687, -0.2720,  ..., -0.1073, -0.9792, -0.0302]]]],
       device='cuda:0', grad_fn=<PermuteBackward0>))), hidden_states=None, attentions=None, cross_attentions=None)

We see that it brings up a lot of data, first let's look at the output keys

	
		outputs.keys()

	
		odict_keys(['logits', 'past_key_values'])

In this case we only have the logits of the model, let's see their size

	
		logits = outputs.logits
logits.shape

	
		torch.Size([1, 8, 50257])

Let's see how many tokens we had at the entrance.

	
		input_tokens.input_ids.shape

	
		torch.Size([1, 8])

Wow, at the output we have the same number of logits as at the input. This is normal

We obtain the logits of the last position of the exit

	
		nex_token_logits = logits[0,-1]
nex_token_logits.shape

	
		torch.Size([50257])

There are a total of 50257 logits, i.e. there is a vocabulary of 50257 tokens and we have to see which token has the highest probability, to do this we first calculate the softmax

	
		softmax_logits = torch.softmax(nex_token_logits, dim=0)
softmax_logits.shape

	
		torch.Size([50257])

Once we have calculated the softmax we obtain the most probable token by looking for the one with the highest probability, that is, the one with the highest value after the softmax.

	
		next_token_prob, next_token_id = torch.max(softmax_logits, dim=0)
next_token_prob, next_token_id

	
		(tensor(0.1732, device='cuda:0', grad_fn=<MaxBackward0>),
 tensor(290, device='cuda:0'))

We have obtained the following token, now we decode it

	
		tokenizer.decode(next_token_id.item())

	
		' and'

We have obtained the following token using the greedy method, that is, the token with the highest probability. But we already saw in the transformers library post, the ways to generate texts that you can do sampling, top-k, top-p, etc.

Let's put everything into a function and see what comes out if we generate a few tokens

	
		def generate_next_greedy_token(input_sentence, tokenizer, model, device):
          input_tokens = tokenizer(input_sentence, return_tensors="pt").to(device)
          outputs = model(**input_tokens)
          logits = outputs.logits
          nex_token_logits = logits[0,-1]
          softmax_logits = torch.softmax(nex_token_logits, dim=0)
          next_token_prob, next_token_id = torch.max(softmax_logits, dim=0)
          return next_token_prob, next_token_id

	
		def generate_next_greedy_token(input_sentence, tokenizer, model, device):
          input_tokens = tokenizer(input_sentence, return_tensors="pt").to(device)
          outputs = model(**input_tokens)
          logits = outputs.logits
          nex_token_logits = logits[0,-1]
          softmax_logits = torch.softmax(nex_token_logits, dim=0)
          next_token_prob, next_token_id = torch.max(softmax_logits, dim=0)
          return next_token_prob, next_token_id
def generate_greedy_text(input_sentence, tokenizer, model, device, max_length=20):
          generated_text = input_sentence
          for _ in range(max_length):
              next_token_prob, next_token_id = generate_next_greedy_token(generated_text, tokenizer, model, device)
              generated_text += tokenizer.decode(next_token_id.item())
          return generated_text

Now we generate text

	
		def generate_next_greedy_token(input_sentence, tokenizer, model, device):
    input_tokens = tokenizer(input_sentence, return_tensors="pt").to(device)
    outputs = model(**input_tokens)
    logits = outputs.logits
    nex_token_logits = logits[0,-1]
    softmax_logits = torch.softmax(nex_token_logits, dim=0)
    next_token_prob, next_token_id = torch.max(softmax_logits, dim=0)
    return next_token_prob, next_token_id
def generate_greedy_text(input_sentence, tokenizer, model, device, max_length=20):
    generated_text = input_sentence
    for _ in range(max_length):
        next_token_prob, next_token_id = generate_next_greedy_token(generated_text, tokenizer, model, device)
        generated_text += tokenizer.decode(next_token_id.item())
    return generated_text
generate_greedy_text("Hello, I'm a language model,", tokenizer, model, device)

	
		"Hello, I'm a language model, and I'm going to help you with your problem.


I'm going to help you"

The output is rather repetitive as already seen in ways to generate text. But still, it is better output than what we obtained with GPT-1.

Architecture of the models available at Hugging Face

If we go to the Hugging Face documentation of GPT2 we can see that we have the options GPT2Model, GPT2LMHeadModel, GPT2ForSequenceClassification, GPT2ForQuestionAnswering, GPT2ForTokenClassification. Let's take a look at them

	
		import torch
      
      ckeckpoints = "openai-community/gpt2"

GPT2Model

This is the base model, i.e. the transformer decoder.

	
		import torch
ckeckpoints = "openai-community/gpt2"
from transformers import GPT2Model
model = GPT2Model.from_pretrained(ckeckpoints)
model

	
		GPT2Model(
  (wte): Embedding(50257, 768)
  (wpe): Embedding(1024, 768)
  (drop): Dropout(p=0.1, inplace=False)
  (h): ModuleList(
    (0-11): 12 x GPT2Block(
      (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (attn): GPT2Attention(
        (c_attn): Conv1D()
        (c_proj): Conv1D()
        (attn_dropout): Dropout(p=0.1, inplace=False)
        (resid_dropout): Dropout(p=0.1, inplace=False)
      )
      (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (mlp): GPT2MLP(
        (c_fc): Conv1D()
        (c_proj): Conv1D()
        (act): NewGELUActivation()
        (dropout): Dropout(p=0.1, inplace=False)
      )
    )
  )
  (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
)

As you can see at the output a tensor of dimension 768, which is the dimension of the embeddings of the small model. If we had used the openai-community/gpt2-xl model, we would have obtained an output of 1600.

Depending on the task you want to do now you would have to add more layers.

We can add them by hand, but the weights of those layers would be initialized randomly. Whereas if we use the Hugging Face models with these layers, the weights are pre-trained.

GPT2LMHeadModel

It is the one we have used before to generate text

	
		from transformers import GPT2LMHeadModel
model = GPT2LMHeadModel.from_pretrained(ckeckpoints)
model

	
		GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)

As you can see it is the same model as before, only that at the end a linear layer has been added with an input of 768 (the embeddings) and an output of 50257, which corresponds to the size of the vocabulary.

GPT2ForSequenceClassification

This option is for classifying text sequences, in this case we have to specify with num_labels the number of classes we want to classify.

from transformers import GPT2ForSequenceClassification
      model = GPT2ForSequenceClassification.from_pretrained(ckeckpoints, num_labels=5)
      model

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at openai-community/gpt2 and are newly initialized: ['score.weight']
      You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Out[10]:

GPT2ForSequenceClassification(
        (transformer): GPT2Model(
          (wte): Embedding(50257, 768)
          (wpe): Embedding(1024, 768)
          (drop): Dropout(p=0.1, inplace=False)
          (h): ModuleList(
            (0-11): 12 x GPT2Block(
              (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
              (attn): GPT2Attention(
                (c_attn): Conv1D()
                (c_proj): Conv1D()
                (attn_dropout): Dropout(p=0.1, inplace=False)
                (resid_dropout): Dropout(p=0.1, inplace=False)
              )
              (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
              (mlp): GPT2MLP(
                (c_fc): Conv1D()
                (c_proj): Conv1D()
                (act): NewGELUActivation()
                (dropout): Dropout(p=0.1, inplace=False)
              )
            )
          )
          (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        )
        (score): Linear(in_features=768, out_features=5, bias=False)
      )

Now, instead of having an output of 50257, we have an output of 5, which is the number we have entered in num_labels and is the number of classes we want to classify

GPT2ForQuestionAnswering

In the transformers post we explained that in this mode you pass a context to the model and a question about the context and it returns the answer.

from transformers import GPT2ForQuestionAnswering
      model = GPT2ForQuestionAnswering.from_pretrained(ckeckpoints)
      model

Some weights of GPT2ForQuestionAnswering were not initialized from the model checkpoint at openai-community/gpt2 and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
      You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Out[13]:

GPT2ForQuestionAnswering(
        (transformer): GPT2Model(
          (wte): Embedding(50257, 768)
          (wpe): Embedding(1024, 768)
          (drop): Dropout(p=0.1, inplace=False)
          (h): ModuleList(
            (0-11): 12 x GPT2Block(
              (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
              (attn): GPT2Attention(
                (c_attn): Conv1D()
                (c_proj): Conv1D()
                (attn_dropout): Dropout(p=0.1, inplace=False)
                (resid_dropout): Dropout(p=0.1, inplace=False)
              )
              (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
              (mlp): GPT2MLP(
                (c_fc): Conv1D()
                (c_proj): Conv1D()
                (act): NewGELUActivation()
                (dropout): Dropout(p=0.1, inplace=False)
              )
            )
          )
          (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        )
        (qa_outputs): Linear(in_features=768, out_features=2, bias=True)
      )

We see that the output gives us a two-dimensional tensor

GPT2ForTokenClassification

Also in the post of transformers we told what token calsification was, we explained that it classified to which category each token corresponded. We have to pass the number of classes we want to classify with num_labels.

from transformers import GPT2ForTokenClassification
      model = GPT2ForTokenClassification.from_pretrained(ckeckpoints, num_labels=5)
      model

Some weights of GPT2ForTokenClassification were not initialized from the model checkpoint at openai-community/gpt2 and are newly initialized: ['classifier.bias', 'classifier.weight']
      You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Out[2]:

GPT2ForTokenClassification(
        (transformer): GPT2Model(
          (wte): Embedding(50257, 768)
          (wpe): Embedding(1024, 768)
          (drop): Dropout(p=0.1, inplace=False)
          (h): ModuleList(
            (0-11): 12 x GPT2Block(
              (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
              (attn): GPT2Attention(
                (c_attn): Conv1D()
                (c_proj): Conv1D()
                (attn_dropout): Dropout(p=0.1, inplace=False)
                (resid_dropout): Dropout(p=0.1, inplace=False)
              )
              (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
              (mlp): GPT2MLP(
                (c_fc): Conv1D()
                (c_proj): Conv1D()
                (act): NewGELUActivation()
                (dropout): Dropout(p=0.1, inplace=False)
              )
            )
          )
          (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        )
        (dropout): Dropout(p=0.1, inplace=False)
        (classifier): Linear(in_features=768, out_features=5, bias=True)
      )

At the output we obtain the 5 classes that we have specified with num_labels.

Fine tuning GPT-2

Fine tuning for text generation

First let's see how the pure Pytorch training would be done.

Loss calculation

Before we start doing the fine tuning of GPT-2 let's see one thing. Before when we used to get the output of the model we did this

	
		from transformers import GPT2ForSequenceClassification
model = GPT2ForSequenceClassification.from_pretrained(ckeckpoints, num_labels=5)
model
from transformers import GPT2ForQuestionAnswering
model = GPT2ForQuestionAnswering.from_pretrained(ckeckpoints)
model
from transformers import GPT2ForTokenClassification
model = GPT2ForTokenClassification.from_pretrained(ckeckpoints, num_labels=5)
model
outputs = model(**input_tokens)
outputs

	
		Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at openai-community/gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of GPT2ForQuestionAnswering were not initialized from the model checkpoint at openai-community/gpt2 and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of GPT2ForTokenClassification were not initialized from the model checkpoint at openai-community/gpt2 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
CausalLMOutputWithCrossAttentions(loss=None, logits=tensor([[[ 6.6288,  5.1421, -0.8002,  ..., -6.3998, -4.4113,  1.8240],
         [ 2.7250,  1.9371, -1.2293,  ..., -5.0979, -5.1617,  2.2694],
         [ 2.6891,  4.3089, -1.6074,  ..., -7.6321, -2.0448,  0.4042],
         ...,
         [ 6.0513,  3.8020, -2.8080,  ..., -6.7754, -8.3176,  1.1541],
         [ 6.8402,  5.6952,  0.2002,  ..., -9.1281, -6.7818,  2.7576],
         [ 1.0255, -0.2201, -2.5484,  ..., -6.2137, -7.2322,  0.1665]]],
       device='cuda:0', grad_fn=<UnsafeViewBackward0>), past_key_values=((tensor([[[[ 0.4779,  0.7671, -0.7532,  ..., -0.3551,  0.4590,  0.3073],
          [ 0.2034, -0.6033,  0.2484,  ...,  0.7760, -0.3546,  0.0198],
          [-0.1968, -0.9029,  0.5570,  ...,  0.9985, -0.5028, -0.3508],
          ...,
          [-0.5007, -0.4009,  0.1604,  ..., -0.3693, -0.1158,  0.1320],
          [-0.4854, -0.1369,  0.7377,  ..., -0.8043, -0.1054,  0.0871],
          [ 0.1610, -0.8358, -0.5534,  ...,  0.9951, -0.3085,  0.4574]],
         [[ 0.6288, -0.1374, -0.3467,  ..., -1.0003, -1.1518,  0.3114],
          [-1.7269,  1.2920, -0.0734,  ...,  1.0572,  1.4698, -2.0412],
          [ 0.2714, -0.0670, -0.4769,  ...,  0.6305,  0.6890, -0.8158],
          ...,
          [-0.0499, -0.0721,  0.4580,  ...,  0.6797,  0.2331,  0.0210],
          [-0.1894,  0.2077,  0.6722,  ...,  0.6938,  0.2104, -0.0574],
          [ 0.3661, -0.0218,  0.2618,  ...,  0.8750,  1.2205, -0.6103]],
         [[ 0.5964,  1.1178,  0.3604,  ...,  0.8426,  0.4881, -0.4094],
          [ 0.3186, -0.3953,  0.2687,  ..., -0.1110, -0.5640,  0.5900],
          ...,
          [ 0.2092,  0.3898, -0.6061,  ..., -0.2859, -0.3136, -0.1002],
          [ 0.0539,  0.8941,  0.3423,  ..., -0.6326, -0.1053, -0.6679],
          [ 0.5628,  0.6687, -0.2720,  ..., -0.1073, -0.9792, -0.0302]]]],
       device='cuda:0', grad_fn=<PermuteBackward0>))), hidden_states=None, attentions=None, cross_attentions=None)

You can see that we get loss=None.

	
		print(outputs.loss)

	
		None

As we are going to need the loss to do the fine tuning, let's see how to obtain it.

If we go to the documentation of the method forward of GPT2LMHeadModel, we can see that it says that at the output it returns an object of type transformers.modeling_outputs.CausalLMOutputWithCrossAttentions, so if we go to the documentation of transformers.modeling_outputs.CausalLMOutputWithCrossAttentions, we can see that it says that it returns loss if labels is passed to the forward method.

If we go to the source code of the forward method, we see this code block

loss = None
              if labels is not None:
                  # move labels to correct device to enable model parallelism
                  labels = labels.to(lm_logits.device)
                  # Shift so that tokens < n predict n
                  shift_logits = lm_logits[..., :-1, :].contiguous()
                  shift_labels = labels[..., 1:].contiguous()
                  # Flatten the tokens
                  loss_fct = CrossEntropyLoss()
                  loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))
      ```
      
      In other words, the `loss` is calculated as follows
      
       * Shift of logits and labels: The first part is to shift the logits (`lm_logits`) and labels (`labels`) so that `tokens < n` predict `n`, i.e., from a position `n` the next token is predicted from the previous ones.
       * CrossEntropyLoss: An instance of the `CrossEntropyLoss()` function is created.
       * Flatten tokens: Logits and labels are then flattened using `view(-1, shift_logits.size(-1))` and `view(-1)`, respectively. This is done so that the logits and labels have the same shape for the loss function.
       * Loss calculation: Finally, the loss is calculated using the `CrossEntropyLoss()` function with the flattened logits and flattened labels as inputs.
      
      In summary, `loss` is calculated as the cross-entropy loss between shifted and flattened logits and shifted and flattened labels.
      
      Therefore, if we pass the labels to the `forward` method, it will return the `loss`.

	
		outputs = model(**input_tokens, labels=input_tokens.input_ids)
outputs.loss

	
		tensor(3.8028, device='cuda:0', grad_fn=<NllLossBackward0>)

Dataset

For the training we are going to use a dataset of English jokes short-jokes-dataset, which is a dataset with 231 thousand English jokes.

Restart the notebook to avoid problems with the GPU memory.

Download the dataset

	
		from datasets import load_dataset
jokes = load_dataset("Maximofn/short-jokes-dataset")
jokes

	
		DatasetDict({
    train: Dataset({
        features: ['ID', 'Joke'],
        num_rows: 231657
    })
})

Let's take a look at it

	
		jokes["train"][0]

	
		{'ID': 1,
 'Joke': '[me narrating a documentary about narrators] "I can't hear what they're saying cuz I'm talking"'}

Model instance

In order to use the xl model, i.e. the one with 1.5B parameters, I change it to FP16 to avoid running out of memory.

	
		import torch
      from transformers import AutoTokenizer, GPT2LMHeadModel
      
      device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
      
      ckeckpoints = "openai-community/gpt2-xl"
      tokenizer = AutoTokenizer.from_pretrained(ckeckpoints)
      model = GPT2LMHeadModel.from_pretrained(ckeckpoints)
      
      model = model.half().to(device)

Pytorch dataset

We create a Pytorch dataset class

	
		import torch
      from transformers import AutoTokenizer, GPT2LMHeadModel
      
      device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
      
      ckeckpoints = "openai-community/gpt2-xl"
      tokenizer = AutoTokenizer.from_pretrained(ckeckpoints)
      model = GPT2LMHeadModel.from_pretrained(ckeckpoints)
      
      model = model.half().to(device)
from torch.utils.data import Dataset
      
      class JokesDataset(Dataset):
          def __init__(self, dataset, tokenizer):
              self.dataset = dataset
              self.joke = "JOKE: "
              self.end_of_text_token = "<|endoftext|>"
              self.tokenizer = tokenizer
              
          def __len__(self):
              return len(self.dataset["train"])
      
          def __getitem__(self, item):
              sentence = self.joke + self.dataset["train"][item]["Joke"] + self.end_of_text_token
              tokens = self.tokenizer(sentence, return_tensors="pt")
              return sentence, tokens

We instantiate it

	
		import torch
      from transformers import AutoTokenizer, GPT2LMHeadModel
      
      device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
      
      ckeckpoints = "openai-community/gpt2-xl"
      tokenizer = AutoTokenizer.from_pretrained(ckeckpoints)
      model = GPT2LMHeadModel.from_pretrained(ckeckpoints)
      
      model = model.half().to(device)
from torch.utils.data import Dataset
      
      class JokesDataset(Dataset):
          def __init__(self, dataset, tokenizer):
              self.dataset = dataset
              self.joke = "JOKE: "
              self.end_of_text_token = "<|endoftext|>"
              self.tokenizer = tokenizer
              
          def __len__(self):
              return len(self.dataset["train"])
      
          def __getitem__(self, item):
              sentence = self.joke + self.dataset["train"][item]["Joke"] + self.end_of_text_token
              tokens = self.tokenizer(sentence, return_tensors="pt")
              return sentence, tokens
dataset = JokesDataset(jokes, tokenizer=tokenizer)

Here is an example

sentence, tokens = dataset[5]
      print(sentence)
      tokens.input_ids.shape, tokens.attention_mask.shape

JOKE: Why can't Barbie get pregnant? Because Ken comes in a different box. Heyooooooo<|endoftext|>

Out[5]:

(torch.Size([1, 22]), torch.Size([1, 22]))

Dataloader

We now create a Pytorch dataloader

	
		import torch
      from transformers import AutoTokenizer, GPT2LMHeadModel
      
      device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
      
      ckeckpoints = "openai-community/gpt2-xl"
      tokenizer = AutoTokenizer.from_pretrained(ckeckpoints)
      model = GPT2LMHeadModel.from_pretrained(ckeckpoints)
      
      model = model.half().to(device)
from torch.utils.data import Dataset
      
      class JokesDataset(Dataset):
          def __init__(self, dataset, tokenizer):
              self.dataset = dataset
              self.joke = "JOKE: "
              self.end_of_text_token = "<|endoftext|>"
              self.tokenizer = tokenizer
              
          def __len__(self):
              return len(self.dataset["train"])
      
          def __getitem__(self, item):
              sentence = self.joke + self.dataset["train"][item]["Joke"] + self.end_of_text_token
              tokens = self.tokenizer(sentence, return_tensors="pt")
              return sentence, tokens
dataset = JokesDataset(jokes, tokenizer=tokenizer)
sentence, tokens = dataset[5]
      print(sentence)
      tokens.input_ids.shape, tokens.attention_mask.shape
from torch.utils.data import DataLoader
      
      BS = 1
      joke_dataloader = DataLoader(dataset, batch_size=BS, shuffle=True)

We see a batch

	
		import torch
from transformers import AutoTokenizer, GPT2LMHeadModel
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
ckeckpoints = "openai-community/gpt2-xl"
tokenizer = AutoTokenizer.from_pretrained(ckeckpoints)
model = GPT2LMHeadModel.from_pretrained(ckeckpoints)
model = model.half().to(device)
from torch.utils.data import Dataset
class JokesDataset(Dataset):
    def __init__(self, dataset, tokenizer):
        self.dataset = dataset
        self.joke = "JOKE: "
        self.end_of_text_token = "<|endoftext|>"
        self.tokenizer = tokenizer
        
    def __len__(self):
        return len(self.dataset["train"])
    def __getitem__(self, item):
        sentence = self.joke + self.dataset["train"][item]["Joke"] + self.end_of_text_token
        tokens = self.tokenizer(sentence, return_tensors="pt")
        return sentence, tokens
dataset = JokesDataset(jokes, tokenizer=tokenizer)
sentence, tokens = dataset[5]
print(sentence)
tokens.input_ids.shape, tokens.attention_mask.shape
from torch.utils.data import DataLoader
BS = 1
joke_dataloader = DataLoader(dataset, batch_size=BS, shuffle=True)
sentences, tokens = next(iter(joke_dataloader))
len(sentences), tokens.input_ids.shape, tokens.attention_mask.shape

	
		JOKE: Why can't Barbie get pregnant? Because Ken comes in a different box. Heyooooooo<|endoftext|>
(1, torch.Size([1, 1, 36]), torch.Size([1, 1, 36]))

Training

from transformers import AdamW, get_linear_schedule_with_warmup
      import tqdm
      
      BATCH_SIZE = 32
      EPOCHS = 5
      LEARNING_RATE = 3e-6
      WARMUP_STEPS = 5000
      MAX_SEQ_LEN = 500
      
      optimizer = AdamW(model.parameters(), lr=LEARNING_RATE)
      scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=WARMUP_STEPS, num_training_steps=-1)
      proc_seq_count = 0
      batch_count = 0
      
      tmp_jokes_tens = None
      
      losses = []
      lrs = []
      
      for epoch in range(EPOCHS):
          
          print(f"EPOCH {epoch} started" + '=' * 30)
          progress_bar = tqdm.tqdm(joke_dataloader, desc="Training")
          
          for sample in progress_bar:
      
              sentence, tokens = sample
              
              #################### "Fit as many joke sequences into MAX_SEQ_LEN sequence as possible" logic start ####
              joke_tens = tokens.input_ids[0].to(device)
      
              # Skip sample from dataset if it is longer than MAX_SEQ_LEN
              if joke_tens.size()[1] > MAX_SEQ_LEN:
                  continue
              
              # The first joke sequence in the sequence
              if not torch.is_tensor(tmp_jokes_tens):
                  tmp_jokes_tens = joke_tens
                  continue
              else:
                  # The next joke does not fit in so we process the sequence and leave the last joke 
                  # as the start for next sequence 
                  if tmp_jokes_tens.size()[1] + joke_tens.size()[1] > MAX_SEQ_LEN:
                      work_jokes_tens = tmp_jokes_tens
                      tmp_jokes_tens = joke_tens
                  else:
                      #Add the joke to sequence, continue and try to add more
                      tmp_jokes_tens = torch.cat([tmp_jokes_tens, joke_tens[:,1:]], dim=1)
                      continue
              ################## Sequence ready, process it trough the model ##################
                  
              outputs = model(work_jokes_tens, labels=work_jokes_tens)
              loss = outputs.loss
              loss.backward()
                          
              proc_seq_count = proc_seq_count + 1
              if proc_seq_count == BATCH_SIZE:
                  proc_seq_count = 0    
                  batch_count += 1
                  optimizer.step()
                  scheduler.step() 
                  optimizer.zero_grad()
                  model.zero_grad()
      
              progress_bar.set_postfix({'loss': loss.item(), 'lr': scheduler.get_last_lr()[0]})
              losses.append(loss.item())
              lrs.append(scheduler.get_last_lr()[0])
              if batch_count == 10:
                  batch_count = 0

/home/wallabot/miniconda3/envs/nlp/lib/python3.11/site-packages/transformers/optimization.py:429: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
        warnings.warn(

EPOCH 0 started==============================

Training:   0%|          | 0/231657 [00:00<?, ?it/s]

Training: 100%|██████████| 231657/231657 [32:29<00:00, 118.83it/s, loss=3.1, lr=2.31e-7]

EPOCH 1 started==============================

Training: 100%|██████████| 231657/231657 [32:34<00:00, 118.55it/s, loss=2.19, lr=4.62e-7]

EPOCH 2 started==============================

Training: 100%|██████████| 231657/231657 [32:36<00:00, 118.42it/s, loss=2.42, lr=6.93e-7]

EPOCH 3 started==============================

Training: 100%|██████████| 231657/231657 [32:23<00:00, 119.18it/s, loss=2.16, lr=9.25e-7]

EPOCH 4 started==============================

Training: 100%|██████████| 231657/231657 [32:22<00:00, 119.25it/s, loss=2.1, lr=1.16e-6]

import numpy as np
      import matplotlib.pyplot as plt
      
      losses_np = np.array(losses)
      lrs_np = np.array(lrs)
      
      plt.figure(figsize=(12,6))
      plt.plot(losses_np, label='loss')
      plt.plot(lrs_np, label='learning rate')
      plt.yscale('log')
      plt.legend()
      plt.show()

Inference

Let's see how well the model makes jokes.

sentence_joke = "JOKE:"
      input_tokens_joke = tokenizer(sentence_joke, return_tensors="pt").to(device)
      output_tokens_joke = model.generate(**input_tokens_joke)
      decoded_output_joke = tokenizer.decode(output_tokens_joke[0], skip_special_tokens=True)
      
      print(f"decoded joke: \n{decoded_output_joke}")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
      /home/wallabot/miniconda3/envs/nlp/lib/python3.11/site-packages/transformers/generation/utils.py:1178: UserWarning: Using the model-agnostic default `max_length` (=20) to control the generation length. We recommend setting `max_new_tokens` to control the maximum length of the generation.
        warnings.warn(

decoded joke: 
      JOKE:!!!!!!!!!!!!!!!!!

You can see that you pass it a sequence with the word joke and it returns a joke. But if you return another sequence it does not

sentence_joke = "My dog is cute and"
      input_tokens_joke = tokenizer(sentence_joke, return_tensors="pt").to(device)
      output_tokens_joke = model.generate(**input_tokens_joke)
      decoded_output_joke = tokenizer.decode(output_tokens_joke[0], skip_special_tokens=True)
      
      print(f"decoded joke: \n{decoded_output_joke}")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.

decoded joke: 
      My dog is cute and!!!!!!!!!!!!!!!

Fine tuning GPT-2 for sentence classification

Now we are going to do a training with the Hugging Face libraries.

Dataset

Let's use the imdb dataset to classify statements into positive and negative ones

	
		from transformers import AdamW, get_linear_schedule_with_warmup
import tqdm
BATCH_SIZE = 32
EPOCHS = 5
LEARNING_RATE = 3e-6
WARMUP_STEPS = 5000
MAX_SEQ_LEN = 500
optimizer = AdamW(model.parameters(), lr=LEARNING_RATE)
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=WARMUP_STEPS, num_training_steps=-1)
proc_seq_count = 0
batch_count = 0
tmp_jokes_tens = None
losses = []
lrs = []
for epoch in range(EPOCHS):
    
    print(f"EPOCH {epoch} started" + '=' * 30)
    progress_bar = tqdm.tqdm(joke_dataloader, desc="Training")
    
    for sample in progress_bar:
        sentence, tokens = sample
        
        #################### "Fit as many joke sequences into MAX_SEQ_LEN sequence as possible" logic start ####
        joke_tens = tokens.input_ids[0].to(device)
        # Skip sample from dataset if it is longer than MAX_SEQ_LEN
        if joke_tens.size()[1] > MAX_SEQ_LEN:
            continue
        
        # The first joke sequence in the sequence
        if not torch.is_tensor(tmp_jokes_tens):
            tmp_jokes_tens = joke_tens
            continue
        else:
            # The next joke does not fit in so we process the sequence and leave the last joke 
            # as the start for next sequence 
            if tmp_jokes_tens.size()[1] + joke_tens.size()[1] > MAX_SEQ_LEN:
                work_jokes_tens = tmp_jokes_tens
                tmp_jokes_tens = joke_tens
            else:
                #Add the joke to sequence, continue and try to add more
                tmp_jokes_tens = torch.cat([tmp_jokes_tens, joke_tens[:,1:]], dim=1)
                continue
        ################## Sequence ready, process it trough the model ##################
            
        outputs = model(work_jokes_tens, labels=work_jokes_tens)
        loss = outputs.loss
        loss.backward()
                    
        proc_seq_count = proc_seq_count + 1
        if proc_seq_count == BATCH_SIZE:
            proc_seq_count = 0    
            batch_count += 1
            optimizer.step()
            scheduler.step() 
            optimizer.zero_grad()
            model.zero_grad()
        progress_bar.set_postfix({'loss': loss.item(), 'lr': scheduler.get_last_lr()[0]})
        losses.append(loss.item())
        lrs.append(scheduler.get_last_lr()[0])
        if batch_count == 10:
            batch_count = 0
import numpy as np
import matplotlib.pyplot as plt
losses_np = np.array(losses)
lrs_np = np.array(lrs)
plt.figure(figsize=(12,6))
plt.plot(losses_np, label='loss')
plt.plot(lrs_np, label='learning rate')
plt.yscale('log')
plt.legend()
plt.show()
sentence_joke = "JOKE:"
input_tokens_joke = tokenizer(sentence_joke, return_tensors="pt").to(device)
output_tokens_joke = model.generate(**input_tokens_joke)
decoded_output_joke = tokenizer.decode(output_tokens_joke[0], skip_special_tokens=True)
print(f"decoded joke: 
{decoded_output_joke}")
sentence_joke = "My dog is cute and"
input_tokens_joke = tokenizer(sentence_joke, return_tensors="pt").to(device)
output_tokens_joke = model.generate(**input_tokens_joke)
decoded_output_joke = tokenizer.decode(output_tokens_joke[0], skip_special_tokens=True)
print(f"decoded joke: 
{decoded_output_joke}")
from datasets import load_dataset
dataset = load_dataset("imdb")
dataset

	
		/home/wallabot/miniconda3/envs/nlp/lib/python3.11/site-packages/transformers/optimization.py:429: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
/home/wallabot/miniconda3/envs/nlp/lib/python3.11/site-packages/transformers/generation/utils.py:1178: UserWarning: Using the model-agnostic default `max_length` (=20) to control the generation length. We recommend setting `max_new_tokens` to control the maximum length of the generation.
  warnings.warn(
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

Let's take a look at it

	
		dataset["train"].info

	
		DatasetInfo(description='', citation='', homepage='', license='', features={'text': Value(dtype='string', id=None), 'label': ClassLabel(names=['neg', 'pos'], id=None)}, post_processed=None, supervised_keys=None, task_templates=None, builder_name='parquet', dataset_name='imdb', config_name='plain_text', version=0.0.0, splits={'train': SplitInfo(name='train', num_bytes=33435948, num_examples=25000, shard_lengths=None, dataset_name='imdb'), 'test': SplitInfo(name='test', num_bytes=32653810, num_examples=25000, shard_lengths=None, dataset_name='imdb'), 'unsupervised': SplitInfo(name='unsupervised', num_bytes=67113044, num_examples=50000, shard_lengths=None, dataset_name='imdb')}, download_checksums={'hf://datasets/imdb@e6281661ce1c48d982bc483cf8a173c1bbeb5d31/plain_text/train-00000-of-00001.parquet': {'num_bytes': 20979968, 'checksum': None}, 'hf://datasets/imdb@e6281661ce1c48d982bc483cf8a173c1bbeb5d31/plain_text/test-00000-of-00001.parquet': {'num_bytes': 20470363, 'checksum': None}, 'hf://datasets/imdb@e6281661ce1c48d982bc483cf8a173c1bbeb5d31/plain_text/unsupervised-00000-of-00001.parquet': {'num_bytes': 41996509, 'checksum': None}}, download_size=83446840, post_processing_size=None, dataset_size=133202802, size_in_bytes=216649642)

Let's take a look at the features of this dataset

	
		dataset["train"].info.features

	
		{'text': Value(dtype='string', id=None),
 'label': ClassLabel(names=['neg', 'pos'], id=None)}

The dataset contains strings and classes. In addition there are two types of classes, pos and neg. Let's create a variable with the number of classes

	
		num_clases = len(dataset["train"].unique("label"))
num_clases

Tokenizer

We create the tokenizer

	
		from transformers import GPT2Tokenizer
      
      checkpoints = "openai-community/gpt2"
      tokenizer = GPT2Tokenizer.from_pretrained(checkpoints, bos_token='<|startoftext|>', eos_token='<|endoftext|>', pad_token='<|pad|>')
      tokenizer.pad_token = tokenizer.eos_token

Now that we have a tokenizer we can tokenize the dataset, since the model only understands tokens.

	
		from transformers import GPT2Tokenizer
      
      checkpoints = "openai-community/gpt2"
      tokenizer = GPT2Tokenizer.from_pretrained(checkpoints, bos_token='<|startoftext|>', eos_token='<|endoftext|>', pad_token='<|pad|>')
      tokenizer.pad_token = tokenizer.eos_token
def tokenize_function(examples):
          return tokenizer(examples["text"], padding="max_length", truncation=True)
      
      tokenized_datasets = dataset.map(tokenize_function, batched=True)

Model

We instantiate the model

	
		from transformers import GPT2Tokenizer
checkpoints = "openai-community/gpt2"
tokenizer = GPT2Tokenizer.from_pretrained(checkpoints, bos_token='<|startoftext|>', eos_token='<|endoftext|>', pad_token='<|pad|>')
tokenizer.pad_token = tokenizer.eos_token
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)
tokenized_datasets = dataset.map(tokenize_function, batched=True)
from transformers import GPT2ForSequenceClassification
model = GPT2ForSequenceClassification.from_pretrained(checkpoints, num_labels=num_clases).half()
model.config.pad_token_id = model.config.eos_token_id

	
		Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at openai-community/gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Evaluation

We create an evaluation metric

	
		import numpy as np
      import evaluate
      
      metric = evaluate.load("accuracy")
      
      def compute_metrics(eval_pred):
          logits, labels = eval_pred
          predictions = np.argmax(logits, axis=-1)
          return metric.compute(predictions=predictions, references=labels)

Trainer

We create the trainer

	
		import numpy as np
      import evaluate
      
      metric = evaluate.load("accuracy")
      
      def compute_metrics(eval_pred):
          logits, labels = eval_pred
          predictions = np.argmax(logits, axis=-1)
          return metric.compute(predictions=predictions, references=labels)
from transformers import Trainer, TrainingArguments
      
      training_args = TrainingArguments(
          output_dir="./results",
          learning_rate=2e-5,
          per_device_train_batch_size=16,
          per_device_eval_batch_size=64,
          num_train_epochs=3,
          weight_decay=0.01,
      )
      
      trainer = Trainer(
          model=model,
          args=training_args,
          train_dataset=tokenized_datasets["train"],
          eval_dataset=tokenized_datasets["test"],
          compute_metrics=compute_metrics,
      )

Training

We train

trainer.train()

[4689/4689 1:27:50, Epoch 3/3]

Step	Training Loss
500	0.379400
1000	0.000000
1500	0.000000
2000	0.000000
2500	0.000000
3000	0.000000
3500	0.000000
4000	0.000000
4500	0.000000

Out[25]:

TrainOutput(global_step=4689, training_loss=0.04045845954294626, metrics={'train_runtime': 5271.3532, 'train_samples_per_second': 14.228, 'train_steps_per_second': 0.89, 'total_flos': 3.91945125888e+16, 'train_loss': 0.04045845954294626, 'epoch': 3.0})

Inference

We tested the model after training

	
		import numpy as np
      import evaluate
      
      metric = evaluate.load("accuracy")
      
      def compute_metrics(eval_pred):
          logits, labels = eval_pred
          predictions = np.argmax(logits, axis=-1)
          return metric.compute(predictions=predictions, references=labels)
from transformers import Trainer, TrainingArguments
      
      training_args = TrainingArguments(
          output_dir="./results",
          learning_rate=2e-5,
          per_device_train_batch_size=16,
          per_device_eval_batch_size=64,
          num_train_epochs=3,
          weight_decay=0.01,
      )
      
      trainer = Trainer(
          model=model,
          args=training_args,
          train_dataset=tokenized_datasets["train"],
          eval_dataset=tokenized_datasets["test"],
          compute_metrics=compute_metrics,
      )
trainer.train()
import torch
      
      device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
      
      def get_sentiment(sentence):
          inputs = tokenizer(sentence, return_tensors="pt").to(device)
          outputs = model(**inputs)
          prediction = outputs.logits.argmax(-1).item()
          return "positive" if prediction == 1 else "negative"

	
		import numpy as np
import evaluate
metric = evaluate.load("accuracy")
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)
from transformers import Trainer, TrainingArguments
training_args = TrainingArguments(
    output_dir="./results",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=64,
    num_train_epochs=3,
    weight_decay=0.01,
)
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    compute_metrics=compute_metrics,
)
trainer.train()
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
def get_sentiment(sentence):
    inputs = tokenizer(sentence, return_tensors="pt").to(device)
    outputs = model(**inputs)
    prediction = outputs.logits.argmax(-1).item()
    return "positive" if prediction == 1 else "negative"
sentence = "I hate this movie!"
print(get_sentiment(sentence))

	
		negative

Continue reading

Agents patterns

Are your agents falling short? Elevate your AI projects with advanced patterns: ReAct, planning, multi-agents, and more. Practical guide with code!

LangGraph: Revolutionize your AI agents

🚀 Revolutionize your AI agents! 🧠 LangGraph is not just another library, it's the orchestration framework that gives you total control to build complex agents, with long-term memory and even human intervention! Say goodbye to basic chatbots, it's time to create true intelligence. Dive into this post and discover it!

Create virtual environments with uv

Learn how to create virtual environments with uv, a package manager and environment for Python written in Rust, which makes it very fast. If you have had problems with the waiting times using conda, or want a faster and easier alternative to venv, enter and see how to use uv.

Last posts -->

Have you seen these projects?

Horeca chatbot

Naviground

Subtify

View all projects -->

Do you want to apply AI in your project? Contact me!

Do you want to improve with these tips?

Memory profiler

See the memory usage of a script

DataLoader with pin_memory and num_workers

Increase DataLoader performance with pin_memory and num_workers

py-smi

Python library to get GPU data like `nvidia-smi`

Last tips -->

Use this locally

Hugging Face spaces allow us to run models with very simple demos, but what if the demo breaks? Or if the user deletes it? That's why I've created docker containers with some interesting spaces, to be able to use them locally, whatever happens. In fact, if you click on any project view button, it may take you to a space that doesn't work.