GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformerslink image 22

In the paper GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers the need to create a post-training quantization method that does not degrade the quality of the model is exposed. In this post we have seen the method llm.int8() that quantizes to INT8 some vectors of the weight matrices, as long as none of their values exceeds a threshold value, which is fine, but they do not quantize all the weights of the model. In this paper they propose a method that quantizes all the weights of the model to 4 and 3 bits, without degrading the quality of the model. This saves considerable memory, not only because all the weights are quantized, but also because it is done at 4, 3 bits (and even 1 and 2 bits under certain conditions), instead of 8 bits.

This notebook has been automatically translated to make it accessible to more people, please let me know if you see any typos.

Works on which it is basedlink image 23

Quantization by layerslink image 24

On the one hand, they are based on the works Nagel et al., 2020; Wang et al., 2020; Hubara et al., 2021 and Frantar et al., 2022, which propose to quantize the weights of the layers of a neural network to 4 and 3 bits, without degrading the quality of the model.

Having a data set m coming from a dataset, to each layer l you put the data and you get the output of the weights W of that layer. So what we do is to look for new quantized Ŵ weights that minimize the quadratic error in relation to the output of the total accuracy layer

argmin_Ŵ||WX- ŴX|||^2

The values of Ŵ are set before performing the quantization process and during the process, each parameter of Ŵ can change its value independently without depending on the value of the other parameters of Ŵ.

Optimal brain quantization (OBQ)link image 25

In the OBQ work of Frantar et al., 2022 they optimize the above layered quantization process, making it up to 3 times faster. This helps with large models, as quantizing a large model can take a long time.

The OBQ method is an approach to solve the layered quantization problem in language models. OBQ starts from the idea that the quadratic error can be decomposed into the sum of individual errors for each row of the weight matrix. The method then quantizes each weight independently, always updating the unquantized weights to compensate for the error incurred by the quantization.

The method is capable of quantifying medium-sized models in reasonable times, but as it is a cubic complexity algorithm it is extremely expensive to apply to models with billions of parameters.

GPTQ Algorithmlink image 26

Step 1: Arbitrary order informationlink image 27

In OBQ they searched for the row of weights that created the lowest mean square error to quantize, but they realized that doing it randomly did not increase the final mean square error much. So instead of looking for the row that minimizes the mean square error, which created a cubic complexity in the algorithm, it is always done in the same order. Thanks to this, the execution time of the quantization algorithm is greatly reduced.

Step 2: Lazy batch updateslink image 28

As the update of the weights is done row by row, it makes it a slow process and does not take full advantage of the hardware. Therefore, they propose to perform the updates in batches of B=128 rows. This makes better use of the hardware and reduces the execution time of the algorithm.

Step 3: Cholesky reformulationlink image 29

The problem with doing batch updates is that, due to the large scale of the models, numerical errors can occur that affect the accuracy of the algorithm. Specifically, undefined matrices can be obtained, causing the algorithm to update the remaining weights in the wrong directions, resulting in very poor quantization.

To solve this, the authors of the paper propose to use a Cholesky reformulation, which is a more numerically stable method.

GPTQ Resultslink image 30

Below are two graphs with the perplexity measure in the WikiText2 dataset for all sizes of the OPT and BLOOM models. It can be seen that with the RTN quantization technique, the perplexity at some sizes increases a lot, while with GPTQ it remains similar to that obtained with the FP16 model.

GPTQ-figure1

Other plots are shown below, but with the measurement of the accuracy in the LAMBADA dataset. The same thing happens, while GPTQ remains similar to that obtained with FP16, other quantization methods greatly degrade the quality of the model.

GPTQ-figure3

Extreme quantizationlink image 31

In the previous graphs we have shown the results of quantizing the model to 3 and 4 bits, but we can quantize them to 2 bits, and even to 1 bit.

By modifying the size of the batches when using the algorithm we can obtain good results quantizing both the model

Modelo FP16 g128 g64 g32 3 bits
OPT-175B 8.34 9.58 9.18 8.94 8.68
BLOOM 8.11 9.55 9.17 8.83 8.64

In the table above you can see the result of the perplexity in the WikiText2 dataset for the OPT-175B and BLOOM models quantized to 3 bits. It can be seen that as smaller batches are used, the perplexity decreases, which means that the quality of the quantized model is better. But it has the problem that the algorithm takes longer to run.

Dynamic discounting in inferencelink image 32

During the inference, something called dynamic dequantization is performed in order to perform the inference. Each layer is dequantized as it is passed through.

To do this, they developed a kernel that de-quantifies the matrices and performs the matrix products. Although the quantization consumes more computation, the kernel has to access much less memory, which generates significant speedups.

The inference is performed in FP16 by discounting the weights as you go through the layers and the activation function of each layer is also performed in FP16. Although this means that more calculations have to be done, because the weights have to be discounted, these calculations make the overall process faster, because less data has to be fetched from memory. The weights have to be fetched from memory in fewer bits, so in the end, in matrices of many parameters it saves a lot of data. The bottleneck is usually in fetching the data from memory, so even if you have to do more calculations, in the end the inference is faster.

Speed of inferencelink image 33

The authors of the paper performed a test quantizing the BLOOM-175B model to 3 bits, which occupied about 63 GB of VRAM memory, including the embeddings and the output layer that are kept in FP16. In addition maintaining the 2048 token context window consumes about 9 GB of memory, totaling about 72 GB of VRAM memory. They quantized in 3 bits and not 4 bits to be able to perform this experiment and to be able to fit the model on a single Nvidia A100 GPU with 80 GB of VRAM memory.

For comparison, normal FP16 inference requires about 350 GB of VRAM memory, which is equivalent to 5 Nvidia A100 GPUs with 80 GB of VRAM memory. And 8-bit quantizing inference using llm.int8() requires 3 such GPUs.

Below is a table with the model inference in FP16 and quantized to 3 bits and Nvidia A100 GPUs with 80 GB of VRAM memory and Nvidia A6000 with 48 GB of VRAM memory.

GPU (VRAM) Average time per token in FP16 (ms) Average time per token in 3 bit (ms) Acceleration Reduction of GPUs required
A6000 (48GB) 589 130 ×4.53 8→ 2
A100 (80GB) 230 71 71 ×3.24 5→ 1

For example, using the kernels, the 3-bit OPT-175B model runs on a single A100 (instead of 5) and is approximately 3.25 times faster than the FP16 version in terms of average time per token.

The NVIDIA A6000 GPU has a much lower memory bandwidth, making this strategy even more effective: running the 3-bit OPT-175B model on 2 A6000 GPUs (instead of 8) is approximately 4.53 times faster than the FP16 version.

Librarieslink image 34

The authors of the paper implemented the GPTQ library. Other libraries were created such as GPTQ-for-LLaMa, exllama and llama.cpp. However these libraries focus only on the llama architecture, so the AutoGPTQ library gained the most popularity because it has a wider coverage of architectures.

Therefore, this library AutoGPTQ was integrated by means of an API inside the transformers library. In order to use it, it is necessary to install it as indicated in the Installation section of its repository and to have the optimun library installed.

In addition to what is indicated in the Installation section of your repository, it is also advisable to do the following:

git clone https://github.com/PanQiWei/AutoGPTQ
      cd AutoGPTQ
      pip install .
      ```
      
      For the GPU quantization kernels developed by the authors of the paper to be installed.
      

Quantization of a modellink image 35

Let's see how to quantize a model with the optimun library and the AutoGPTQ API.

Unquantized model inferencelink image 36

We are going to quantize the meta-flame/Meta-Flame-3-8B-Instruct model which as its name indicates is an 8B parameter model, so in FP16 we would need 16 GB of VRAM memory. First we run the model to see the memory it occupies and the output it generates

As to use this model we have to ask permission to Meta, we log in to HuggingFace to download the tokenizer and the model.

	
from huggingface_hub import notebook_login
notebook_login()
Copy

Instantiate the tokenizer and the model

	
from huggingface_hub import notebook_login
notebook_login()
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
checkpoint = "meta-llama/Meta-Llama-3-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint).half().to(device)
Copy

Let's take a look at the memory occupied in FP16

	
from huggingface_hub import notebook_login
notebook_login()
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
checkpoint = "meta-llama/Meta-Llama-3-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint).half().to(device)
model_memory = model.get_memory_footprint()/(1024**3)
print(f"Model memory: {model_memory:.2f} GB")
Copy
	
Model memory: 14.96 GB

We see that it occupies almost 15 GB, more or less the 16 GB that we had said it should occupy, but why this difference? Surely this model does not have exactly 8B of parameters, but it has a little less, but when indicating the number of parameters it is rounded to 8B.

We make an inference to see how it does it and how long it takes.

import time
      
      input_tokens = tokenizer("Hello my name is Maximo and I am a Machine Learning Engineer", return_tensors="pt").to(model.device)
      
      t0 = time.time()
      max_new_tokens = 50
      outputs = model.generate(
          input_ids=input_tokens.input_ids,
          attention_mask=input_tokens.attention_mask,
          max_length=input_tokens.input_ids.shape[1] + max_new_tokens,
      )
      print(tokenizer.decode(outputs[0], skip_special_tokens=True))
      print(f"Inference time: {time.time() - t0:.2f} s")
      
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
      
Hello my name is Maximo and I am a Machine Learning Engineer at a startup in the Bay Area. I am passionate about building AI systems that can help humans make better decisions and improve their lives.
      
      I have a background in computer science and mathematics, and I have been working with machine learning for several years. I
      Inference time: 4.14 s
      

Quantization of the model to 4 bitslink image 37

Let's quantize it to 4 bits. I restart the notebook to avoid memory problems, so we log into Hugging Face again.

	
import time
input_tokens = tokenizer("Hello my name is Maximo and I am a Machine Learning Engineer", return_tensors="pt").to(model.device)
t0 = time.time()
max_new_tokens = 50
outputs = model.generate(
input_ids=input_tokens.input_ids,
attention_mask=input_tokens.attention_mask,
max_length=input_tokens.input_ids.shape[1] + max_new_tokens,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
print(f"Inference time: {time.time() - t0:.2f} s")
from huggingface_hub import notebook_login
notebook_login()
Copy

First I create the tokenizer

	
import time
input_tokens = tokenizer("Hello my name is Maximo and I am a Machine Learning Engineer", return_tensors="pt").to(model.device)
t0 = time.time()
max_new_tokens = 50
outputs = model.generate(
input_ids=input_tokens.input_ids,
attention_mask=input_tokens.attention_mask,
max_length=input_tokens.input_ids.shape[1] + max_new_tokens,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
print(f"Inference time: {time.time() - t0:.2f} s")
from huggingface_hub import notebook_login
notebook_login()
from transformers import AutoTokenizer
checkpoint = "meta-llama/Meta-Llama-3-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
Copy
	
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

Now we create the quantization configuration. As we have said, this algorithm calculates the error of the quantized weights over the original weights based on inputs from a dataset, so in the configuration we have to pass it with which dataset we want to quantize the model.

The available defaults are wikitext2,c4,c4-new,ptb and ptb-new.

We can also create a dataset ourselves from a list of strings

dataset = ["auto-gptq is an easy-to-use model quantization library with user-friendly apis, based on GPTQ algorithm."]
      

We also have to tell it the number of bits that the quantized model has by means of the bits parameter.

	
from transformers import GPTQConfig
quantization_config = GPTQConfig(bits=4, dataset = "c4", tokenizer=tokenizer)
Copy

We quantify the model

from transformers import AutoModelForCausalLM
      import time
      
      t0 = time.time()
      model_4bits = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto", quantization_config=quantization_config)
      t_quantization = time.time() - t0
      print(f"Quantization time: {t_quantization:.2f} s = {t_quantization/60:.2f} min")
      
Loading checkpoint shards: 100%|██████████| 4/4 [00:00<?, ?it/s]
Quantizing model.layers blocks : 100%|██████████|32/32 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py:4565: FutureWarning: `_is_quantized_training_enabled` is going to be deprecated in transformers 4.39.0. Please use `model.hf_quantizer.is_trainable` instead
        warnings.warn(
      
Quantization time: 1932.09 s = 32.20 min
      

As the quantization process calculates the smallest error between the quantized weights and the original weights by passing inputs through each layer, the quantization process takes time. In this case it took about half an hour

Let's see how much memory it occupies now

	
from transformers import GPTQConfig
quantization_config = GPTQConfig(bits=4, dataset = "c4", tokenizer=tokenizer)
from transformers import AutoModelForCausalLM
import time
t0 = time.time()
model_4bits = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto", quantization_config=quantization_config)
t_quantization = time.time() - t0
print(f"Quantization time: {t_quantization:.2f} s = {t_quantization/60:.2f} min")
model_4bits_memory = model_4bits.get_memory_footprint()/(1024**3)
print(f"Model memory: {model_4bits_memory:.2f} GB")
Copy
	
Loading checkpoint shards: 100%|██████████| 4/4 [00:00<?, ?it/s]
Model memory: 5.34 GB

Here we can see a benefit of quantization. While the original model occupied about 15 GB of VRAM, now the quantized model occupies about 5 GB, almost a third of the original size.

We make the inference and see how long it takes

import time
      
      input_tokens = tokenizer("Hello my name is Maximo and I am a Machine Learning Engineer", return_tensors="pt").to(model_4bits.device)
      
      t0 = time.time()
      max_new_tokens = 50
      outputs = model_4bits.generate(
          input_ids=input_tokens.input_ids,
          attention_mask=input_tokens.attention_mask,
          max_length=input_tokens.input_ids.shape[1] + max_new_tokens,
      )
      print(tokenizer.decode(outputs[0], skip_special_tokens=True))
      print(f"Inference time: {time.time() - t0:.2f} s")
      
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
      
Hello my name is Maximo and I am a Machine Learning Engineer. I have a strong background in computer science and mathematics, and I am passionate about developing innovative solutions that can positively impact society. I am excited to be a part of this community and to learn from and contribute to the discussions here. I am particularly
      Inference time: 2.34 s
      

The unquantized model took 4.14 seconds, while now quantized to 4 bits it has taken 2.34 seconds and also generated the text well. We have managed to reduce the inference by almost half.

As the size of the quantized model is almost one third of the model in FP16, we could think that the inference speed should be about three times faster with the quantized model. But we have to remember that in each layer the weights are quantized and the computations are performed in FP16, so we only managed to reduce the inference time by half and not by one third.

Now we save the model

	
import time
input_tokens = tokenizer("Hello my name is Maximo and I am a Machine Learning Engineer", return_tensors="pt").to(model_4bits.device)
t0 = time.time()
max_new_tokens = 50
outputs = model_4bits.generate(
input_ids=input_tokens.input_ids,
attention_mask=input_tokens.attention_mask,
max_length=input_tokens.input_ids.shape[1] + max_new_tokens,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
print(f"Inference time: {time.time() - t0:.2f} s")
save_folder = "./model_4bits/"
model_4bits.save_pretrained(save_folder)
tokenizer.save_pretrained(save_folder)
Copy
	
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
('./model_4bits/tokenizer_config.json',
'./model_4bits/special_tokens_map.json',
'./model_4bits/tokenizer.json')

And we upload it to the hub

repo_id = "Llama-3-8B-Instruct-GPTQ-4bits"
      commit_message = f"AutoGPTQ model for {checkpoint}: {quantization_config.bits}bits, gr{quantization_config.group_size}, desc_act={quantization_config.desc_act}"
      model_4bits.push_to_hub(repo_id, commit_message=commit_message)
      
README.md: 100%|██████████| 5.17/5.17k [00:00<?, ?B/s]
CommitInfo(commit_url='https://huggingface.co/Maximofn/Llama-3-8B-Instruct-GPTQ-4bits/commit/44cfdcad78db260122943d3f57858c1b840bda17', commit_message='AutoGPTQ model for meta-llama/Meta-Llama-3-8B-Instruct: 4bits, gr128, desc_act=False', commit_description='', oid='44cfdcad78db260122943d3f57858c1b840bda17', pr_url=None, pr_revision=None, pr_num=None)

We also upload the tokenizer. Although we have not changed the tokenizer, we upload it because if a person downloads our model from the hub, he/she does not need to know which tokenizer we have used, so he/she will probably want to download the model and the tokenizer together. We can indicate in the model card which tokenizer we have used to download it, but most likely they will not read the model card, try to download the tokenizer, get an error and not know what to do. So we upload it to save us that problem.

repo_id = "Llama-3-8B-Instruct-GPTQ-4bits"
      commit_message = f"Tokenizers for {checkpoint}: {quantization_config.bits}bits, gr{quantization_config.group_size}, desc_act={quantization_config.desc_act}"
      tokenizer.push_to_hub(repo_id, commit_message=commit_message)
      
README.md: 100%|██████████| 0.00/5.17k [00:00<?, ?B/s]
Out[5]:
CommitInfo(commit_url='https://huggingface.co/Maximofn/Llama-3-8B-Instruct-GPTQ-4bits/commit/75600041ca6e38b5f1fb912ad1803b66656faae4', commit_message='Tokenizers for meta-llama/Meta-Llama-3-8B-Instruct: 4bits, gr128, desc_act=False', commit_description='', oid='75600041ca6e38b5f1fb912ad1803b66656faae4', pr_url=None, pr_revision=None, pr_num=None)

Quantization of the model to 3 bitslink image 38

Let's quantize it to 3 bits. I restart the notebook to avoid memory problems and log in to Hugging Face again.

	
repo_id = "Llama-3-8B-Instruct-GPTQ-4bits"
commit_message = f"AutoGPTQ model for {checkpoint}: {quantization_config.bits}bits, gr{quantization_config.group_size}, desc_act={quantization_config.desc_act}"
model_4bits.push_to_hub(repo_id, commit_message=commit_message)
repo_id = "Llama-3-8B-Instruct-GPTQ-4bits"
commit_message = f"Tokenizers for {checkpoint}: {quantization_config.bits}bits, gr{quantization_config.group_size}, desc_act={quantization_config.desc_act}"
tokenizer.push_to_hub(repo_id, commit_message=commit_message)
from huggingface_hub import notebook_login
notebook_login()
Copy

First I create the tokenizer

	
repo_id = "Llama-3-8B-Instruct-GPTQ-4bits"
commit_message = f"AutoGPTQ model for {checkpoint}: {quantization_config.bits}bits, gr{quantization_config.group_size}, desc_act={quantization_config.desc_act}"
model_4bits.push_to_hub(repo_id, commit_message=commit_message)
repo_id = "Llama-3-8B-Instruct-GPTQ-4bits"
commit_message = f"Tokenizers for {checkpoint}: {quantization_config.bits}bits, gr{quantization_config.group_size}, desc_act={quantization_config.desc_act}"
tokenizer.push_to_hub(repo_id, commit_message=commit_message)
from huggingface_hub import notebook_login
notebook_login()
from transformers import AutoTokenizer
checkpoint = "meta-llama/Meta-Llama-3-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
Copy
	
README.md: 100%|██████████| 5.17/5.17k [00:00<?, ?B/s]
README.md: 100%|██████████| 0.00/5.17k [00:00<?, ?B/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

We create the quantization configuration, now we indicate that we want to quantize to 3 bits

	
from transformers import GPTQConfig
quantization_config = GPTQConfig(bits=3, dataset = "c4", tokenizer=tokenizer)
Copy

We quantify the model

from transformers import AutoModelForCausalLM
      import time
      
      t0 = time.time()
      model_3bits = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto", quantization_config=quantization_config)
      t_quantization = time.time() - t0
      print(f"Quantization time: {t_quantization:.2f} s = {t_quantization/60:.2f} min")
      
Loading checkpoint shards: 100%|██████████| 4/4 [00:00<?, ?it/s]
Quantizing model.layers blocks : 100%|██████████|32/32 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py:4565: FutureWarning: `_is_quantized_training_enabled` is going to be deprecated in transformers 4.39.0. Please use `model.hf_quantizer.is_trainable` instead
        warnings.warn(
      
Quantization time: 1912.69 s = 31.88 min
      

As before, it has taken about half an hour.

Let's see how much memory it occupies now

	
from transformers import GPTQConfig
quantization_config = GPTQConfig(bits=3, dataset = "c4", tokenizer=tokenizer)
from transformers import AutoModelForCausalLM
import time
t0 = time.time()
model_3bits = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto", quantization_config=quantization_config)
t_quantization = time.time() - t0
print(f"Quantization time: {t_quantization:.2f} s = {t_quantization/60:.2f} min")
model_3bits_memory = model_3bits.get_memory_footprint()/(1024**3)
print(f"Model memory: {model_3bits_memory:.2f} GB")
Copy
	
Loading checkpoint shards: 100%|██████████| 4/4 [00:00<?, ?it/s]
Model memory: 4.52 GB

The memory occupied by the 3-bit model is also almost 5 GB. The 4-bit model occupied 5.34 GB, while now in 3-bit it occupies 4.52 GB, so we have managed to reduce the size of the model a little more.

We make the inference and see how long it takes

import time
      
      input_tokens = tokenizer("Hello my name is Maximo and I am a Machine Learning Engineer", return_tensors="pt").to(model_3bits.device)
      
      t0 = time.time()
      max_new_tokens = 50
      outputs = model_3bits.generate(
          input_ids=input_tokens.input_ids,
          attention_mask=input_tokens.attention_mask,
          max_length=input_tokens.input_ids.shape[1] + max_new_tokens,
      )
      print(tokenizer.decode(outputs[0], skip_special_tokens=True))
      print(f"Inference time: {time.time() - t0:.2f} s")
      
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
      
Hello my name is Maximo and I am a Machine Learning Engineer at Google. I am excited to be here today to talk about my work in the field of Machine Learning and to share some of the insights I have gained through my experiences.
      I am a Machine Learning Engineer at Google, and I am excited to be
      Inference time: 2.89 s
      

Although the output on 3 bits is good, now the inference time has been 2.89 seconds, while on 4 bits it was 2.34 seconds. More testing should be done to see if it always takes less time on 4 bits, or maybe the difference is so small that sometimes the 3-bit inference is faster and sometimes the 4-bit inference is faster.

In addition, although the output makes sense, it starts to become repetitive.

Save the model

	
import time
input_tokens = tokenizer("Hello my name is Maximo and I am a Machine Learning Engineer", return_tensors="pt").to(model_3bits.device)
t0 = time.time()
max_new_tokens = 50
outputs = model_3bits.generate(
input_ids=input_tokens.input_ids,
attention_mask=input_tokens.attention_mask,
max_length=input_tokens.input_ids.shape[1] + max_new_tokens,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
print(f"Inference time: {time.time() - t0:.2f} s")
save_folder = "./model_3bits/"
model_3bits.save_pretrained(save_folder)
tokenizer.save_pretrained(save_folder)
Copy
	
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
('./model_3bits/tokenizer_config.json',
'./model_3bits/special_tokens_map.json',
'./model_3bits/tokenizer.json')

And we upload it to the hub

repo_id = "Llama-3-8B-Instruct-GPTQ-3bits"
      commit_message = f"AutoGPTQ model for {checkpoint}: {quantization_config.bits}bits, gr{quantization_config.group_size}, desc_act={quantization_config.desc_act}"
      model_3bits.push_to_hub(repo_id, commit_message=commit_message)
      
model.safetensors: 100%|██████████| 4.85/4.85G [00:00<?, ?B/s]
Out[14]:
CommitInfo(commit_url='https://huggingface.co/Maximofn/Llama-3-8B-Instruct-GPTQ-3bits/commit/422fd94a031234c10224ddbe09c0e029a5e9c01f', commit_message='AutoGPTQ model for meta-llama/Meta-Llama-3-8B-Instruct: 3bits, gr128, desc_act=False', commit_description='', oid='422fd94a031234c10224ddbe09c0e029a5e9c01f', pr_url=None, pr_revision=None, pr_num=None)

We also upload the tokenizer

repo_id = "Llama-3-8B-Instruct-GPTQ-3bits"
      commit_message = f"Tokenizers for {checkpoint}: {quantization_config.bits}bits, gr{quantization_config.group_size}, desc_act={quantization_config.desc_act}"
      tokenizer.push_to_hub(repo_id, commit_message=commit_message)
      
README.md: 100%|██████████| 0.00/5.17k [00:00<?, ?B/s]
CommitInfo(commit_url='https://huggingface.co/Maximofn/Llama-3-8B-Instruct-GPTQ-4bits/commit/75600041ca6e38b5f1fb912ad1803b66656faae4', commit_message='Tokenizers for meta-llama/Meta-Llama-3-8B-Instruct: 4bits, gr128, desc_act=False', commit_description='', oid='75600041ca6e38b5f1fb912ad1803b66656faae4', pr_url=None, pr_revision=None, pr_num=None)

Quantization of the model to 2 bitslink image 39

Let's quantize it to 2 bits. I restart the notebook to avoid memory problems and log in again in Hugging Face

	
repo_id = "Llama-3-8B-Instruct-GPTQ-3bits"
commit_message = f"AutoGPTQ model for {checkpoint}: {quantization_config.bits}bits, gr{quantization_config.group_size}, desc_act={quantization_config.desc_act}"
model_3bits.push_to_hub(repo_id, commit_message=commit_message)
repo_id = "Llama-3-8B-Instruct-GPTQ-3bits"
commit_message = f"Tokenizers for {checkpoint}: {quantization_config.bits}bits, gr{quantization_config.group_size}, desc_act={quantization_config.desc_act}"
tokenizer.push_to_hub(repo_id, commit_message=commit_message)
from huggingface_hub import notebook_login
notebook_login()
Copy

First I create the tokenizer

	
repo_id = "Llama-3-8B-Instruct-GPTQ-3bits"
commit_message = f"AutoGPTQ model for {checkpoint}: {quantization_config.bits}bits, gr{quantization_config.group_size}, desc_act={quantization_config.desc_act}"
model_3bits.push_to_hub(repo_id, commit_message=commit_message)
repo_id = "Llama-3-8B-Instruct-GPTQ-3bits"
commit_message = f"Tokenizers for {checkpoint}: {quantization_config.bits}bits, gr{quantization_config.group_size}, desc_act={quantization_config.desc_act}"
tokenizer.push_to_hub(repo_id, commit_message=commit_message)
from huggingface_hub import notebook_login
notebook_login()
from transformers import AutoTokenizer
checkpoint = "meta-llama/Meta-Llama-3-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
Copy
	
model.safetensors: 100%|██████████| 4.85/4.85G [00:00<?, ?B/s]
README.md: 100%|██████████| 0.00/5.17k [00:00<?, ?B/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

We create the quantization configuration. Now we tell it to quantize to 2 bits. In addition we must indicate how many vectors of the weight matrix quantize at a time by means of the parameter group_size, before by default it had the value 128 and we did not touch it, but now when quantizing to 2 bits, to have less error we put a smaller value. If we leave it at 128, the quantized model would work very badly, in this case I am going to put a value of 16.

	
from transformers import GPTQConfig
quantization_config = GPTQConfig(bits=2, dataset = "c4", tokenizer=tokenizer, group_size=16)
Copy
from transformers import AutoModelForCausalLM
      import time
      
      t0 = time.time()
      model_2bits = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto", quantization_config=quantization_config)
      t_quantization = time.time() - t0
      print(f"Quantization time: {t_quantization:.2f} s = {t_quantization/60:.2f} min")
      
Loading checkpoint shards: 100%|██████████| 4/4 [00:00<?, ?it/s]
Quantizing model.layers blocks : 100%|██████████|32/32 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py:4565: FutureWarning: `_is_quantized_training_enabled` is going to be deprecated in transformers 4.39.0. Please use `model.hf_quantizer.is_trainable` instead
        warnings.warn(
      
Quantization time: 1973.12 s = 32.89 min
      

We see that it has also taken about half an hour.

Let's see how much memory it occupies now

	
from transformers import GPTQConfig
quantization_config = GPTQConfig(bits=2, dataset = "c4", tokenizer=tokenizer, group_size=16)
from transformers import AutoModelForCausalLM
import time
t0 = time.time()
model_2bits = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto", quantization_config=quantization_config)
t_quantization = time.time() - t0
print(f"Quantization time: {t_quantization:.2f} s = {t_quantization/60:.2f} min")
model_2bits_memory = model_2bits.get_memory_footprint()/(1024**3)
print(f"Model memory: {model_2bits_memory:.2f} GB")
Copy
	
Loading checkpoint shards: 100%|██████████| 4/4 [00:00<?, ?it/s]
Model memory: 4.50 GB

While quantized at 4 bits it occupied 5.34 GB and at 3 bits it occupied 4.52 GB, now quantized at 2 bits it occupies 4.50 GB, so we have managed to reduce the size of the model a little more.

We make the inference and see how long it takes

import time
      
      input_tokens = tokenizer("Hello my name is Maximo and I am a Machine Learning Engineer", return_tensors="pt").to(model_2bits.device)
      
      t0 = time.time()
      max_new_tokens = 50
      outputs = model_2bits.generate(
          input_ids=input_tokens.input_ids,
          attention_mask=input_tokens.attention_mask,
          max_length=input_tokens.input_ids.shape[1] + max_new_tokens,
      )
      print(tokenizer.decode(outputs[0], skip_special_tokens=True))
      print(f"Inference time: {time.time() - t0:.2f} s")
      
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
      
Hello my name is Maximo and I am a Machine Learning Engineer.  # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
      Inference time: 2.92 s
      

We see that the output is already not good, and the inference time is 2.92 seconds, more or less the same as with 3 and 4 bits.

Save the model

	
import time
input_tokens = tokenizer("Hello my name is Maximo and I am a Machine Learning Engineer", return_tensors="pt").to(model_2bits.device)
t0 = time.time()
max_new_tokens = 50
outputs = model_2bits.generate(
input_ids=input_tokens.input_ids,
attention_mask=input_tokens.attention_mask,
max_length=input_tokens.input_ids.shape[1] + max_new_tokens,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
print(f"Inference time: {time.time() - t0:.2f} s")
save_folder = "./model_2bits/"
model_2bits.save_pretrained(save_folder)
tokenizer.save_pretrained(save_folder)
Copy
	
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
('./model_2bits/tokenizer_config.json',
'./model_2bits/special_tokens_map.json',
'./model_2bits/tokenizer.json')

We upload it to the hub

repo_id = "Llama-3-8B-Instruct-GPTQ-2bits"
      commit_message = f"AutoGPTQ model for {checkpoint}: {quantization_config.bits}bits, gr{quantization_config.group_size}, desc_act={quantization_config.desc_act}"
      model_2bits.push_to_hub(repo_id, commit_message=commit_message)
      
model.safetensors: 100%|██████████| 4.83/4.83G [00:00<?, ?B/s]
Out[8]:
CommitInfo(commit_url='https://huggingface.co/Maximofn/Llama-3-8B-Instruct-GPTQ-2bits/commit/13ede006ce0dbbd8aca54212e960eff98ea5ec63', commit_message='AutoGPTQ model for meta-llama/Meta-Llama-3-8B-Instruct: 2bits, gr16, desc_act=False', commit_description='', oid='13ede006ce0dbbd8aca54212e960eff98ea5ec63', pr_url=None, pr_revision=None, pr_num=None)

Quantization of the model to 1 bitlink image 40

Let's quantize it to 1 bit. I restart the notebook to avoid memory problems and log in again in Hugging Face

	
repo_id = "Llama-3-8B-Instruct-GPTQ-2bits"
commit_message = f"AutoGPTQ model for {checkpoint}: {quantization_config.bits}bits, gr{quantization_config.group_size}, desc_act={quantization_config.desc_act}"
model_2bits.push_to_hub(repo_id, commit_message=commit_message)
from huggingface_hub import notebook_login
notebook_login()
Copy

First I create the tokenizer

	
repo_id = "Llama-3-8B-Instruct-GPTQ-2bits"
commit_message = f"AutoGPTQ model for {checkpoint}: {quantization_config.bits}bits, gr{quantization_config.group_size}, desc_act={quantization_config.desc_act}"
model_2bits.push_to_hub(repo_id, commit_message=commit_message)
from huggingface_hub import notebook_login
notebook_login()
from transformers import AutoTokenizer
checkpoint = "meta-llama/Meta-Llama-3-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
Copy
	
model.safetensors: 100%|██████████| 4.83/4.83G [00:00<?, ?B/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

We create the quantization configuration, now we tell it to quantize to only 1 bit and also to use a group_size of 8

	
from transformers import GPTQConfig
quantization_config = GPTQConfig(bits=2, dataset = "c4", tokenizer=tokenizer, group_size=8)
Copy
from transformers import AutoModelForCausalLM
      import time
      
      t0 = time.time()
      model_1bits = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto", quantization_config=quantization_config)
      t_quantization = time.time() - t0
      print(f"Quantization time: {t_quantization:.2f} s = {t_quantization/60:.2f} min")
      
Loading checkpoint shards: 100%|██████████| 4/4 [00:00<?, ?it/s]
Quantizing model.layers blocks : 100%|██████████|32/32 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
Quantizing layers inside the block: 100%|██████████| 7/7 [00:00<?, ?it/s]
/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py:4565: FutureWarning: `_is_quantized_training_enabled` is going to be deprecated in transformers 4.39.0. Please use `model.hf_quantizer.is_trainable` instead
        warnings.warn(
      
Quantization time: 2030.38 s = 33.84 min
      

We see that it also takes about half an hour to quantize

Let's see how much memory it occupies now

	
from transformers import GPTQConfig
quantization_config = GPTQConfig(bits=2, dataset = "c4", tokenizer=tokenizer, group_size=8)
from transformers import AutoModelForCausalLM
import time
t0 = time.time()
model_1bits = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto", quantization_config=quantization_config)
t_quantization = time.time() - t0
print(f"Quantization time: {t_quantization:.2f} s = {t_quantization/60:.2f} min")
model_1bits_memory = model_1bits.get_memory_footprint()/(1024**3)
print(f"Model memory: {model_1bits_memory:.2f} GB")
Copy
	
Loading checkpoint shards: 100%|██████████| 4/4 [00:00<?, ?it/s]
Model memory: 5.42 GB

We see that in this case it occupies even more than quantized to 2 bits, 4.52 GB.

We make the inference and see how long it takes

import time
      
      input_tokens = tokenizer("Hello my name is Maximo and I am a Machine Learning Engineer", return_tensors="pt").to(model_1bits.device)
      
      t0 = time.time()
      max_new_tokens = 50
      outputs = model_1bits.generate(
          input_ids=input_tokens.input_ids,
          attention_mask=input_tokens.attention_mask,
          max_length=input_tokens.input_ids.shape[1] + max_new_tokens,
      )
      print(tokenizer.decode(outputs[0], skip_special_tokens=True))
      print(f"Inference time: {time.time() - t0:.2f} s")
      
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
      
Hello my name is Maximo and I am a Machine Learning Engineerimerszuimersimerspinsimersimersingoingoimersurosimersimersimersoleningoimersingopinsimersbirpinsimersimersimersorgeingoimersiringimersimersimersimersimersimersimersンディorge_REFERER ingest羊imersorgeimersimersendetingoШАhandsingo
      Inference time: 3.12 s
      

We see that the output is very bad and also takes longer than when we have quantized to 2 bits.

Save the model

	
import time
input_tokens = tokenizer("Hello my name is Maximo and I am a Machine Learning Engineer", return_tensors="pt").to(model_1bits.device)
t0 = time.time()
max_new_tokens = 50
outputs = model_1bits.generate(
input_ids=input_tokens.input_ids,
attention_mask=input_tokens.attention_mask,
max_length=input_tokens.input_ids.shape[1] + max_new_tokens,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
print(f"Inference time: {time.time() - t0:.2f} s")
save_folder = "./model_1bits/"
model_1bits.save_pretrained(save_folder)
tokenizer.save_pretrained(save_folder)
Copy
	
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
('./model_1bits/tokenizer_config.json',
'./model_1bits/special_tokens_map.json',
'./model_1bits/tokenizer.json')

We upload it to the hub

repo_id = "Llama-3-8B-Instruct-GPTQ-1bits"
      commit_message = f"AutoGPTQ model for {checkpoint}: {quantization_config.bits}bits, gr{quantization_config.group_size}, desc_act={quantization_config.desc_act}"
      model_1bits.push_to_hub(repo_id, commit_message=commit_message)
      
README.md: 100%|██████████| 0.00/5.17k [00:00<?, ?B/s]
Upload 2 LFS files: 100%|██████████| 0/2 [00:00<?, ?it/s]
model-00002-of-00002.safetensors: 100%|██████████| 0.00/1.05G [00:00<?, ?B/s]
model-00001-of-00002.safetensors: 100%|██████████| 0.00/4.76G [00:00<?, ?B/s]
Out[8]:
CommitInfo(commit_url='https://huggingface.co/Maximofn/Llama-3-8B-Instruct-GPTQ-2bits/commit/e59ccffc03247e7dcc418f98b482cc02dc7a168d', commit_message='AutoGPTQ model for meta-llama/Meta-Llama-3-8B-Instruct: 2bits, gr8, desc_act=False', commit_description='', oid='e59ccffc03247e7dcc418f98b482cc02dc7a168d', pr_url=None, pr_revision=None, pr_num=None)

Quantization summarylink image 41

Let's buy the quantization to 4, 3, 2 and 1 bits

Bits Quantization time (min) Memory (GB) Inference time (s) Quality of the output
FP16 0 14.96 4.14 Good
4 32.20 5.34 2.34 Good
3 31.88 4.52 2.89 Good
2 32.89 4.50 2.92 Poor
1 33.84 5.42 3.12 Poor

Looking at this table we see that it does not make sense, in this example, to quantize to less than 4 bits.

Quantizing at 1 and 2 bits clearly does not make sense because the output quality is poor.

But although the output when quantizing at 3 bits is good, it started to be repetitive, so in the long run, it would probably not be a good idea to use that model. Besides neither the savings in quantization time, the savings in VRAM and the savings in inference time is significant compared to quantizing to 4 bits.

Loading of the saved modellink image 42

Now that we have compared the quantization of models, let's see how it would be done to load the 4-bit model that we have saved, since as we have seen, it is the best choice

First we load the tokenizer that we have used

	
repo_id = "Llama-3-8B-Instruct-GPTQ-1bits"
commit_message = f"AutoGPTQ model for {checkpoint}: {quantization_config.bits}bits, gr{quantization_config.group_size}, desc_act={quantization_config.desc_act}"
model_1bits.push_to_hub(repo_id, commit_message=commit_message)
from transformers import AutoTokenizer
path = "./model_4bits"
tokenizer = AutoTokenizer.from_pretrained(path)
Copy
	
README.md: 100%|██████████| 0.00/5.17k [00:00<?, ?B/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

Now we load the model we have saved

	
from transformers import AutoModelForCausalLM
load_model_4bits = AutoModelForCausalLM.from_pretrained(path, device_map="auto")
Copy
	
Loading checkpoint shards: 100%|██████████| 2/2 [00:00<?, ?it/s]

We see the memory occupied by

	
load_model_4bits_memory = load_model_4bits.get_memory_footprint()/(1024**3)
print(f"Model memory: {load_model_4bits_memory:.2f} GB")
Copy
	
Model memory: 5.34 GB

We see that it occupies the same memory as when we quantized it, which is logical.

We make the inference and see how long it takes

import time
      
      input_tokens = tokenizer("Hello my name is Maximo and I am a Machine Learning Engineer", return_tensors="pt").to(load_model_4bits.device)
      
      t0 = time.time()
      max_new_tokens = 50
      outputs = load_model_4bits.generate(
          input_ids=input_tokens.input_ids,
          attention_mask=input_tokens.attention_mask,
          max_length=input_tokens.input_ids.shape[1] + max_new_tokens,
      )
      print(tokenizer.decode(outputs[0], skip_special_tokens=True))
      print(f"Inference time: {time.time() - t0:.2f} s")
      
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
      
Hello my name is Maximo and I am a Machine Learning Engineer. I have a strong background in computer science and mathematics, and I have been working with machine learning models for several years. I am excited to be a part of this community and to share my knowledge and experience with others. I am particularly interested in
      Inference time: 3.82 s
      

We see that the inference is good and it took 3.82 seconds, a little longer than when we quantized it. But as I said before, we would have to do this test many times and take an average.

Loading of the model uploaded to the hublink image 43

Now we see how to load the 4-bit model that we have uploaded to the hub

First we load the tokenizer that we have uploaded

	
import time
input_tokens = tokenizer("Hello my name is Maximo and I am a Machine Learning Engineer", return_tensors="pt").to(load_model_4bits.device)
t0 = time.time()
max_new_tokens = 50
outputs = load_model_4bits.generate(
input_ids=input_tokens.input_ids,
attention_mask=input_tokens.attention_mask,
max_length=input_tokens.input_ids.shape[1] + max_new_tokens,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
print(f"Inference time: {time.time() - t0:.2f} s")
from transformers import AutoTokenizer
checkpoint = "Maximofn/Llama-3-8B-Instruct-GPTQ-4bits"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
Copy
	
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

Now we load the model we have saved

	
from transformers import AutoModelForCausalLM
load_model_4bits = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto")
Copy

We see the memory occupied by

	
from transformers import AutoModelForCausalLM
load_model_4bits = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto")
load_model_4bits_memory = load_model_4bits.get_memory_footprint()/(1024**3)
print(f"Model memory: {load_model_4bits_memory:.2f} GB")
Copy
	
Model memory: 5.34 GB

It also occupies the same memory

We make the inference and see how long it takes

import time
      
      input_tokens = tokenizer("Hello my name is Maximo and I am a Machine Learning Engineer", return_tensors="pt").to(load_model_4bits.device)
      
      t0 = time.time()
      max_new_tokens = 50
      outputs = load_model_4bits.generate(
          input_ids=input_tokens.input_ids,
          attention_mask=input_tokens.attention_mask,
          max_length=input_tokens.input_ids.shape[1] + max_new_tokens,
      )
      print(tokenizer.decode(outputs[0], skip_special_tokens=True))
      print(f"Inference time: {time.time() - t0:.2f} s")
      
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
      
Hello my name is Maximo and I am a Machine Learning Engineer with a passion for building innovative AI solutions. I have been working in the field of AI for over 5 years, and have gained extensive experience in developing and implementing machine learning models for various industries.
      
      In my free time, I enjoy reading books on
      Inference time: 3.81 s
      

We see that the inference is also good and took 3.81 seconds.

Continue reading

DoLa – Decoding by Contrasting Layers Improves Factuality in Large Language Models

DoLa – Decoding by Contrasting Layers Improves Factuality in Large Language Models

Have you ever talked to an LLM and they answered you something that sounds like they've been drinking machine coffee all night long 😂 That's what we call a hallucination in the LLM world! But don't worry, because it's not that your language model is crazy (although it can sometimes seem that way 🤪). The truth is that LLMs can be a bit... creative when it comes to generating text. But thanks to DoLa, a method that uses contrast layers to improve the feasibility of LLMs, we can keep our language models from turning into science fiction writers 😂. In this post, I'll explain how DoLa works and show you a code example so you can better understand how to make your LLMs more reliable and less prone to making up stories. Let's save our LLMs from insanity and make them more useful! 🚀

Last posts -->

Have you seen these projects?

Subtify

Subtify Subtify

Subtitle generator for videos in the language you want. Also, it puts a different color subtitle to each person

View all projects -->

Do you want to apply AI in your project? Contact me!

Do you want to improve with these tips?

Last tips -->

Use this locally

Hugging Face spaces allow us to run models with very simple demos, but what if the demo breaks? Or if the user deletes it? That's why I've created docker containers with some interesting spaces, to be able to use them locally, whatever happens. In fact, if you click on any project view button, it may take you to a space that doesn't work.

View all containers -->

Do you want to apply AI in your project? Contact me!

Do you want to train your model with these datasets?

short-jokes-dataset

Dataset with jokes in English

opus100

Dataset with translations from English to Spanish

netflix_titles

Dataset with Netflix movies and series

View more datasets -->