GPT1 - Improving Language Understanding by Generative Pre-Training
Paper
This notebook has been automatically translated to make it accessible to more people, please let me know if you see any typos.
Improving Language Understanding by Generative Pre-Training is the GPT1 paper. Before reading the post you need to put yourself in situation, before GPT language models were based on recurrent networks (RNN), which were networks that worked relatively well for specific tasks, but with which you could not reuse the pre-training to make them a fine tuning for other tasks. In addition, they did not have much memory, so if you put very long sentences in them, they did not remember the beginning of the sentence very well.
Architecture
Before we talk about the architecture of GPT1, let's remember what the architecture of the Transformers was like.
GPT1 is a model based on the transformer decoders, so as we do not have an encoder, the architecture of a single decoder is as follows
The attention mechanism between the encoder and decoder sentence is eliminated.
In the GPT1 paper they propose the following architecture
Which corresponds to the decoder of a transformer as we have seen before, executed 12 times
Paper abstract
The most interesting ideas in the paper are:
- The model is trained on a large corpus of unsupervised text. This is used to create a language model. A high-capacity language model is created on a large corpus of text.
- Fine-tuning is then performed on supervised NLP tasks with labeled datasets. Fine-tuning is performed on a supervised target task. In addition, when the model is evaluated on the supervised task, it is not only evaluated on that task, but on how well it predicts the next token, this helps to improve the generalization of the supervised model and makes the model converge faster.
- Although we have already mentioned it, the paper says that the transformer architecture is used, since up to that time RNNs were used for the language models. This resulted in an improvement in that what was learned in the first training (training on the unsupervised text corpus) is easier to transfer to supervised tasks. That is, thanks to the use of transformers, it was possible to train on a whole corpus of text and then fine-tune it in supervised tasks.
- They evaluated the model in four types of language comprehension tasks:
- Natural language inference
- Answer to questions
- Semantic similarity
- Classification of texts.
- The general model (the one trained on the entire unsupervised text corpus) outperforms discriminatively trained RNN models that employ task-specific designed architectures, significantly improving the state of the art in 9 of the 12 tasks studied. They also analyze the "zero-shot" behaviors of the pre-trained model in four different environments and showed that it acquires useful linguistic knowledge for subsequent tasks.
- In recent years, researchers had demonstrated the benefits of using embeddings, which are trained on unlabeled corpora, to improve performance on a variety of tasks. However, these approaches primarily transfer information at the word level, whereas the use of transformers trained on large unsupervised text corpora captures higher-level, sentence-level semantics.
Text generation
Let's see how to generate text with a pre-trained GPT1
First you have to install ftfy
and spacy
via
pip install ftfy spacy
```
Once installed, you must download the spacy language model you wish to use. For example, to download the English model, you can run:
python -m spacy download en_core_web_sm
```
To generate text we will use the model from the GPT1 repository of Hugging Face.
We import the libraries
import torchfrom transformers import OpenAIGPTTokenizer, OpenAIGPTLMHeadModel, AutoTokenizer
If you notice we have imported OpenAIGPTTokenizer
and AutoTokenizer
. This is because in the model card of GPT1 it says to use OpenAIGPTTokenizer
, but in the transformers library post we explain that you should use AutoTokenizer
to load the tokenizer. So let's try both
import torchfrom transformers import OpenAIGPTTokenizer, OpenAIGPTLMHeadModel, AutoTokenizerckeckpoints = "openai-community/openai-gpt"tokenizer = OpenAIGPTTokenizer.from_pretrained(ckeckpoints)auto_tokenizer = AutoTokenizer.from_pretrained(ckeckpoints)input_tokens = tokenizer("Hello, my dog is cute and", return_tensors="pt")input_auto_tokens = auto_tokenizer("Hello, my dog is cute and", return_tensors="pt")print(f"input tokens: {input_tokens}")print(f"input auto tokens: {input_auto_tokens}")
input tokens:{'input_ids': tensor([[3570, 240, 547, 2585, 544, 4957, 488]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]])}input auto tokens:{'input_ids': tensor([[3570, 240, 547, 2585, 544, 4957, 488]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]])}
As you can see with the two tokenizers you get the same tokens. So to make the code more general, so that if you change the ckeckpoints, you don't have to change the code, let's use AutoTokenizer
.
We then create the device, the tokenizer and the model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")tokenizer = AutoTokenizer.from_pretrained(ckeckpoints)model = OpenAIGPTLMHeadModel.from_pretrained(ckeckpoints).to(device)
As we have instantiated the model let's see how many parameters it has
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")tokenizer = AutoTokenizer.from_pretrained(ckeckpoints)model = OpenAIGPTLMHeadModel.from_pretrained(ckeckpoints).to(device)params = sum(p.numel() for p in model.parameters())print(f"Number of parameters: {round(params/1e6)}M")
Number of parameters: 117M
At the time of billions of parameters, we can see that GPT1 only had 117 million parameters.
We create the input tokens for the model
input_sentence = "Hello, my dog is cute and"input_tokens = tokenizer(input_sentence, return_tensors="pt").to(device)input_tokens
{'input_ids': tensor([[3570, 240, 547, 2585, 544, 4957, 488]], device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]], device='cuda:0')}
We pass them to the model to generate the output tokens.
output_tokens = model.generate(**input_tokens)
print(f"output tokens: \n{output_tokens}")
We decode the tokens to obtain the output statement
output_tokens = model.generate(**input_tokens)print(f"output tokens: {output_tokens}")decoded_output = tokenizer.decode(output_tokens[0], skip_special_tokens=True)print(f"decoded output: {decoded_output}")
output tokens:tensor([[ 3570, 240, 547, 2585, 544, 4957, 488, 249, 719, 797,485, 921, 575, 562, 246, 1671, 239, 244, 40477, 244]],device='cuda:0')decoded output:hello, my dog is cute and i'm going to take him for a walk. ""
We have already succeeded in generating text with GPT1
Generate text token to token
Greedy search
We have used model.generate
to generate the output tokens all at once, but let's see how to generate them one by one. To do this, instead of using model.generate
we are going to use model
, which actually calls the model.forward
method.
outputs = model(**input_tokens)outputs
CausalLMOutput(loss=None, logits=tensor([[[ -5.9486, -5.8697, -18.4258, ..., -9.7371, -10.4495, 0.8814],[ -6.1212, -4.8031, -14.3970, ..., -6.5411, -9.5051, -1.2015],[ -7.4231, -6.3615, -14.7297, ..., -10.4575, -8.4600, -1.5183],...,[ -5.4751, -5.8803, -13.7767, ..., -10.5048, -12.4167, -6.1584],[ -7.2052, -6.0198, -21.5040, ..., -16.2941, -14.0494, -1.2416],[ -7.7240, -7.3631, -17.3174, ..., -12.1546, -12.3327, -1.7169]]],device='cuda:0', grad_fn=<UnsafeViewBackward0>), hidden_states=None, attentions=None)
We see that it pulls a lot of data, first let's look at the output keys
outputs.keys()
odict_keys(['logits'])
In this case we only have the logits of the model, let's see their size
logits = outputs.logitslogits.shape
torch.Size([1, 7, 40478])
Let's see how many tokens we had at the entrance.
input_tokens.input_ids.shape
torch.Size([1, 7])
Wow, at the output we have the same number of logits as at the input. This is normal
We obtain the logits of the last position of the exit
nex_token_logits = logits[0,-1]nex_token_logits.shape
torch.Size([40478])
There are a total of 40478 logits, i.e. there is a vocabulary of 40478 tokens and we have to see which token has the highest probability, to do this we first calculate the softmax
softmax_logits = torch.softmax(nex_token_logits, dim=0)softmax_logits.shape
torch.Size([40478])
next_token_prob, next_token_id = torch.max(softmax_logits, dim=0)next_token_prob, next_token_id
(tensor(0.1898, device='cuda:0', grad_fn=<MaxBackward0>),tensor(249, device='cuda:0'))
We have obtained the following token, now we decode it
tokenizer.decode(next_token_id.item())
'i'
We have obtained the following token using the greedy method, i.e. the token with the highest probability. But we already saw in the transformers library post, the ways to generate texts that sampling, top-k, top-p, etc. can be done.
Let's put everything into a function and see what comes out if we generate a few tokens
def generate_next_greedy_token(input_sentence, tokenizer, model, device):input_tokens = tokenizer(input_sentence, return_tensors="pt").to(device)outputs = model(**input_tokens)logits = outputs.logitsnex_token_logits = logits[0,-1]softmax_logits = torch.softmax(nex_token_logits, dim=0)next_token_prob, next_token_id = torch.max(softmax_logits, dim=0)return next_token_prob, next_token_id
def generate_next_greedy_token(input_sentence, tokenizer, model, device):input_tokens = tokenizer(input_sentence, return_tensors="pt").to(device)outputs = model(**input_tokens)logits = outputs.logitsnex_token_logits = logits[0,-1]softmax_logits = torch.softmax(nex_token_logits, dim=0)next_token_prob, next_token_id = torch.max(softmax_logits, dim=0)return next_token_prob, next_token_iddef generate_greedy_text(input_sentence, tokenizer, model, device, max_length=20):generated_text = input_sentencefor _ in range(max_length):next_token_prob, next_token_id = generate_next_greedy_token(generated_text, tokenizer, model, device)generated_text += tokenizer.decode(next_token_id.item())return generated_text
Now we generate text
def generate_next_greedy_token(input_sentence, tokenizer, model, device):input_tokens = tokenizer(input_sentence, return_tensors="pt").to(device)outputs = model(**input_tokens)logits = outputs.logitsnex_token_logits = logits[0,-1]softmax_logits = torch.softmax(nex_token_logits, dim=0)next_token_prob, next_token_id = torch.max(softmax_logits, dim=0)return next_token_prob, next_token_iddef generate_greedy_text(input_sentence, tokenizer, model, device, max_length=20):generated_text = input_sentencefor _ in range(max_length):next_token_prob, next_token_id = generate_next_greedy_token(generated_text, tokenizer, model, device)generated_text += tokenizer.decode(next_token_id.item())return generated_textgenerate_greedy_text("Hello, my dog is cute and", tokenizer, model, device)
'Hello, my dog is cute andi." '
The output is quite repetitive as already seen in the ways to generate text
Fine tuning GPT
Loss calculation
Before we start doing the fine tuning of GPT1 let's see one thing. Before when we used to get the output of the model we did this
outputs = model(**input_tokens)outputs
CausalLMOutput(loss=None, logits=tensor([[[ -5.9486, -5.8697, -18.4258, ..., -9.7371, -10.4495, 0.8814],[ -6.1212, -4.8031, -14.3970, ..., -6.5411, -9.5051, -1.2015],[ -7.4231, -6.3615, -14.7297, ..., -10.4575, -8.4600, -1.5183],...,[ -5.4751, -5.8803, -13.7767, ..., -10.5048, -12.4167, -6.1584],[ -7.2052, -6.0198, -21.5040, ..., -16.2941, -14.0494, -1.2416],[ -7.7240, -7.3631, -17.3174, ..., -12.1546, -12.3327, -1.7169]]],device='cuda:0', grad_fn=<UnsafeViewBackward0>), hidden_states=None, attentions=None)
You can see that we get loss=None
.
print(outputs.loss)
None
As we are going to need the loss to do the fine tuning, let's see how to obtain it.
If we go to the documentation of the method forward of OpenAIGPTLMHeadModel
, we can see that it says that at the output it returns an object of type transformers.modeling_outputs.CausalLMOutput
, so if we go to the documentation of transformers.modeling_outputs.CausalLMOutput, we can see that it says that it returns loss
if labels
is passed to the forward
method.
If we go to the source code of the forward method, we see this code block
loss = None
if labels is not None:
# Shift so that tokens < n predict n
shift_logits = lm_logits[..., :-1, :].contiguous()
shift_labels = labels[..., 1:].contiguous()
# Flatten the tokens
loss_fct = CrossEntropyLoss()
loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))
```
In other words, the `loss` is calculated as follows
* Shift of logits and labels: The first part is to shift the logits (`lm_logits`) and labels (`labels`) so that `tokens < n` predict `n`, i.e., from a position `n` the next token is predicted from the previous ones.
* CrossEntropyLoss: An instance of the `CrossEntropyLoss()` function is created.
* Flatten tokens: Logits and labels are then flattened using `view(-1, shift_logits.size(-1))` and `view(-1)`, respectively. This is done so that the logits and labels have the same shape for the loss function.
* Loss calculation: Finally, the loss is calculated using the `CrossEntropyLoss()` function with the flattened logits and flattened labels as inputs.
In summary, `loss` is calculated as the cross-entropy loss between shifted and flattened logits and shifted and flattened labels.
Therefore, if we pass the labels to the `forward` method, it will return the `loss`.
outputs = model(**input_tokens, labels=input_tokens.input_ids)outputs.loss
tensor(4.2607, device='cuda:0', grad_fn=<NllLossBackward0>)
Dataset
For the training we are going to use a dataset of English jokes short-jokes-dataset, which is a dataset with 231 thousand English jokes.
Download the dataset
from datasets import load_datasetjokes = load_dataset("Maximofn/short-jokes-dataset")jokes
DatasetDict({train: Dataset({features: ['ID', 'Joke'],num_rows: 231657})})
Let's take a look at it
jokes["train"][0]
{'ID': 1,'Joke': '[me narrating a documentary about narrators] "I can't hear what they're saying cuz I'm talking"'}
Pytorch training
First let's see how the pure Pytorch training would be done.
Restart the notebook to avoid problems with the GPU memory.
import torchfrom transformers import OpenAIGPTLMHeadModel, AutoTokenizerdevice = torch.device("cuda" if torch.cuda.is_available() else "cpu")ckeckpoints = "openai-community/openai-gpt"tokenizer = AutoTokenizer.from_pretrained(ckeckpoints)model = OpenAIGPTLMHeadModel.from_pretrained(ckeckpoints)model = model.to(device)
Pytorch dataset
We create a Pytorch dataset class
import torchfrom transformers import OpenAIGPTLMHeadModel, AutoTokenizerdevice = torch.device("cuda" if torch.cuda.is_available() else "cpu")ckeckpoints = "openai-community/openai-gpt"tokenizer = AutoTokenizer.from_pretrained(ckeckpoints)model = OpenAIGPTLMHeadModel.from_pretrained(ckeckpoints)model = model.to(device)from torch.utils.data import Datasetclass JokesDataset(Dataset):def __init__(self, dataset, tokenizer):self.dataset = datasetself.joke = "JOKE: "self.end_of_text_token = "<|endoftext|>"self.tokenizer = tokenizerdef __len__(self):return len(self.dataset["train"])def __getitem__(self, item):sentence = self.joke + self.dataset["train"][item]["Joke"] + self.end_of_text_tokentokens = self.tokenizer(sentence, return_tensors="pt")return sentence, tokens
We instantiate it
import torchfrom transformers import OpenAIGPTLMHeadModel, AutoTokenizerdevice = torch.device("cuda" if torch.cuda.is_available() else "cpu")ckeckpoints = "openai-community/openai-gpt"tokenizer = AutoTokenizer.from_pretrained(ckeckpoints)model = OpenAIGPTLMHeadModel.from_pretrained(ckeckpoints)model = model.to(device)from torch.utils.data import Datasetclass JokesDataset(Dataset):def __init__(self, dataset, tokenizer):self.dataset = datasetself.joke = "JOKE: "self.end_of_text_token = "<|endoftext|>"self.tokenizer = tokenizerdef __len__(self):return len(self.dataset["train"])def __getitem__(self, item):sentence = self.joke + self.dataset["train"][item]["Joke"] + self.end_of_text_tokentokens = self.tokenizer(sentence, return_tensors="pt")return sentence, tokensdataset = JokesDataset(jokes, tokenizer=tokenizer)
Here is an example
sentence, tokens = dataset[5]
print(sentence)
tokens.input_ids.shape, tokens.attention_mask.shape
Dataloader
We now create a Pytorch dataloader
import torchfrom transformers import OpenAIGPTLMHeadModel, AutoTokenizerdevice = torch.device("cuda" if torch.cuda.is_available() else "cpu")ckeckpoints = "openai-community/openai-gpt"tokenizer = AutoTokenizer.from_pretrained(ckeckpoints)model = OpenAIGPTLMHeadModel.from_pretrained(ckeckpoints)model = model.to(device)from torch.utils.data import Datasetclass JokesDataset(Dataset):def __init__(self, dataset, tokenizer):self.dataset = datasetself.joke = "JOKE: "self.end_of_text_token = "<|endoftext|>"self.tokenizer = tokenizerdef __len__(self):return len(self.dataset["train"])def __getitem__(self, item):sentence = self.joke + self.dataset["train"][item]["Joke"] + self.end_of_text_tokentokens = self.tokenizer(sentence, return_tensors="pt")return sentence, tokensdataset = JokesDataset(jokes, tokenizer=tokenizer)sentence, tokens = dataset[5]print(sentence)tokens.input_ids.shape, tokens.attention_mask.shapefrom torch.utils.data import DataLoaderBS = 1joke_dataloader = DataLoader(dataset, batch_size=BS, shuffle=True)
We see a batch
import torchfrom transformers import OpenAIGPTLMHeadModel, AutoTokenizerdevice = torch.device("cuda" if torch.cuda.is_available() else "cpu")ckeckpoints = "openai-community/openai-gpt"tokenizer = AutoTokenizer.from_pretrained(ckeckpoints)model = OpenAIGPTLMHeadModel.from_pretrained(ckeckpoints)model = model.to(device)from torch.utils.data import Datasetclass JokesDataset(Dataset):def __init__(self, dataset, tokenizer):self.dataset = datasetself.joke = "JOKE: "self.end_of_text_token = "<|endoftext|>"self.tokenizer = tokenizerdef __len__(self):return len(self.dataset["train"])def __getitem__(self, item):sentence = self.joke + self.dataset["train"][item]["Joke"] + self.end_of_text_tokentokens = self.tokenizer(sentence, return_tensors="pt")return sentence, tokensdataset = JokesDataset(jokes, tokenizer=tokenizer)sentence, tokens = dataset[5]print(sentence)tokens.input_ids.shape, tokens.attention_mask.shapefrom torch.utils.data import DataLoaderBS = 1joke_dataloader = DataLoader(dataset, batch_size=BS, shuffle=True)sentences, tokens = next(iter(joke_dataloader))len(sentences), tokens.input_ids.shape, tokens.attention_mask.shape
JOKE: Why can't Barbie get pregnant? Because Ken comes in a different box. Heyooooooo<|endoftext|>(1, torch.Size([1, 1, 29]), torch.Size([1, 1, 29]))
Training
from transformers import AdamW, get_linear_schedule_with_warmup
import tqdm
BATCH_SIZE = 32
EPOCHS = 5
LEARNING_RATE = 3e-5
WARMUP_STEPS = 5000
MAX_SEQ_LEN = 500
model.train()
optimizer = AdamW(model.parameters(), lr=LEARNING_RATE)
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=WARMUP_STEPS, num_training_steps=-1)
proc_seq_count = 0
batch_count = 0
tmp_jokes_tens = None
for epoch in range(EPOCHS):
print(f"EPOCH {epoch} started" + '=' * 30)
progress_bar = tqdm.tqdm(joke_dataloader, desc="Training")
for sample in progress_bar:
sentence, tokens = sample
#################### "Fit as many joke sequences into MAX_SEQ_LEN sequence as possible" logic start ####
joke_tens = tokens.input_ids[0].to(device)
# Skip sample from dataset if it is longer than MAX_SEQ_LEN
if joke_tens.size()[1] > MAX_SEQ_LEN:
continue
# The first joke sequence in the sequence
if not torch.is_tensor(tmp_jokes_tens):
tmp_jokes_tens = joke_tens
continue
else:
# The next joke does not fit in so we process the sequence and leave the last joke
# as the start for next sequence
if tmp_jokes_tens.size()[1] + joke_tens.size()[1] > MAX_SEQ_LEN:
work_jokes_tens = tmp_jokes_tens
tmp_jokes_tens = joke_tens
else:
#Add the joke to sequence, continue and try to add more
tmp_jokes_tens = torch.cat([tmp_jokes_tens, joke_tens[:,1:]], dim=1)
continue
################## Sequence ready, process it trough the model ##################
outputs = model(work_jokes_tens, labels=work_jokes_tens)
loss = outputs.loss
loss.backward()
proc_seq_count = proc_seq_count + 1
if proc_seq_count == BATCH_SIZE:
proc_seq_count = 0
batch_count += 1
optimizer.step()
scheduler.step()
optimizer.zero_grad()
model.zero_grad()
progress_bar.set_postfix({'loss': loss.item(), 'lr': scheduler.get_last_lr()[0]})
if batch_count == 10:
batch_count = 0
Inference
Let's see how well the model makes jokes.
from transformers import AdamW, get_linear_schedule_with_warmupimport tqdmBATCH_SIZE = 32EPOCHS = 5LEARNING_RATE = 3e-5WARMUP_STEPS = 5000MAX_SEQ_LEN = 500model.train()optimizer = AdamW(model.parameters(), lr=LEARNING_RATE)scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=WARMUP_STEPS, num_training_steps=-1)proc_seq_count = 0batch_count = 0tmp_jokes_tens = Nonefor epoch in range(EPOCHS):print(f"EPOCH {epoch} started" + '=' * 30)progress_bar = tqdm.tqdm(joke_dataloader, desc="Training")for sample in progress_bar:sentence, tokens = sample#################### "Fit as many joke sequences into MAX_SEQ_LEN sequence as possible" logic start ####joke_tens = tokens.input_ids[0].to(device)# Skip sample from dataset if it is longer than MAX_SEQ_LENif joke_tens.size()[1] > MAX_SEQ_LEN:continue# The first joke sequence in the sequenceif not torch.is_tensor(tmp_jokes_tens):tmp_jokes_tens = joke_tenscontinueelse:# The next joke does not fit in so we process the sequence and leave the last joke# as the start for next sequenceif tmp_jokes_tens.size()[1] + joke_tens.size()[1] > MAX_SEQ_LEN:work_jokes_tens = tmp_jokes_tenstmp_jokes_tens = joke_tenselse:#Add the joke to sequence, continue and try to add moretmp_jokes_tens = torch.cat([tmp_jokes_tens, joke_tens[:,1:]], dim=1)continue################## Sequence ready, process it trough the model ##################outputs = model(work_jokes_tens, labels=work_jokes_tens)loss = outputs.lossloss.backward()proc_seq_count = proc_seq_count + 1if proc_seq_count == BATCH_SIZE:proc_seq_count = 0batch_count += 1optimizer.step()scheduler.step()optimizer.zero_grad()model.zero_grad()progress_bar.set_postfix({'loss': loss.item(), 'lr': scheduler.get_last_lr()[0]})if batch_count == 10:batch_count = 0sentence_joke = "JOKE:"input_tokens_joke = tokenizer(sentence_joke, return_tensors="pt").to(device)output_tokens_joke = model.generate(**input_tokens_joke)decoded_output_joke = tokenizer.decode(output_tokens_joke[0], skip_special_tokens=True)print(f"decoded joke: {decoded_output_joke}")
/home/wallabot/miniconda3/envs/nlp/lib/python3.11/site-packages/transformers/optimization.py:429: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warningwarnings.warn(decoded joke:joke : what do you call a group of people who are not afraid of the dark? a group
You can see that you pass it a sequence with the word joke
and it returns a joke. But if you return another sequence it does not
sentence_joke = "My dog is cute and"input_tokens_joke = tokenizer(sentence_joke, return_tensors="pt").to(device)output_tokens_joke = model.generate(**input_tokens_joke)decoded_output_joke = tokenizer.decode(output_tokens_joke[0], skip_special_tokens=True)print(f"decoded joke: {decoded_output_joke}")
decoded joke:my dog is cute and i'm not sure if i should be offended or not. "