In the paper GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers the need to create a post-training quantization method that does not degrade the quality of the model is exposed. In this post we have seen the method llm.int8() that quantizes to INT8 some vectors of the weight matrices, as long as none of their values exceeds a threshold value, which is fine, but they do not quantize all the weights of the model. In this paper they propose a method that quantizes all the weights of the model to 4 and 3 bits, without degrading the quality of the model. This saves considerable memory, not only because all the weights are quantized, but also because it is done at 4, 3 bits (and even 1 and 2 bits under certain conditions), instead of 8 bits.
This notebook has been automatically translated to make it accessible to more people, please let me know if you see any typos.
## Works on which it is based
On the one hand, they are based on the works Nagel et al., 2020
; Wang et al., 2020
; Hubara et al., 2021
and Frantar et al., 2022
, which propose to quantize the weights of the layers of a neural network to 4 and 3 bits, without degrading the quality of the model.
Having a data set m
coming from a dataset, to each layer l
you put the data and you get the output of the weights W
of that layer. So what we do is to look for new quantized Ŵ
weights that minimize the quadratic error in relation to the output of the total accuracy layer
argmin_Ŵ||WX- ŴX|||^2
The values of Ŵ
are set before performing the quantization process and during the process, each parameter of Ŵ
can change its value independently without depending on the value of the other parameters of Ŵ
.
In the OBQ
work of Frantar et al., 2022
they optimize the above layered quantization process, making it up to 3 times faster. This helps with large models, as quantizing a large model can take a long time.
The OBQ
method is an approach to solve the layered quantization problem in language models. OBQ
starts from the idea that the quadratic error can be decomposed into the sum of individual errors for each row of the weight matrix. The method then quantizes each weight independently, always updating the unquantized weights to compensate for the error incurred by the quantization.
The method is capable of quantifying medium-sized models in reasonable times, but as it is a cubic complexity algorithm it is extremely expensive to apply to models with billions of parameters.
In OBQ
they searched for the row of weights that created the lowest mean square error to quantize, but they realized that doing it randomly did not increase the final mean square error much. So instead of looking for the row that minimizes the mean square error, which created a cubic complexity in the algorithm, it is always done in the same order. Thanks to this, the execution time of the quantization algorithm is greatly reduced.
As the update of the weights is done row by row, it makes it a slow process and does not take full advantage of the hardware. Therefore, they propose to perform the updates in batches of B=128
rows. This makes better use of the hardware and reduces the execution time of the algorithm.
The problem with doing batch updates is that, due to the large scale of the models, numerical errors can occur that affect the accuracy of the algorithm. Specifically, undefined matrices can be obtained, causing the algorithm to update the remaining weights in the wrong directions, resulting in very poor quantization.
To solve this, the authors of the paper propose to use a Cholesky reformulation, which is a more numerically stable method.
Below are two graphs with the perplexity measure in the WikiText2
dataset for all sizes of the OPT and BLOOM models. It can be seen that with the RTN quantization technique, the perplexity at some sizes increases a lot, while with GPTQ it remains similar to that obtained with the FP16 model.
Other plots are shown below, but with the measurement of the accuracy in the LAMBADA
dataset. The same thing happens, while GPTQ remains similar to that obtained with FP16, other quantization methods greatly degrade the quality of the model.
In the previous graphs we have shown the results of quantizing the model to 3 and 4 bits, but we can quantize them to 2 bits, and even to 1 bit.
By modifying the size of the batches when using the algorithm we can obtain good results quantizing both the model
Modelo | FP16 | g128 | g64 | g32 | 3 bits |
---|---|---|---|---|---|
OPT-175B | 8.34 | 9.58 | 9.18 | 8.94 | 8.68 |
BLOOM | 8.11 | 9.55 | 9.17 | 8.83 | 8.64 |
In the table above you can see the result of the perplexity in the WikiText2
dataset for the OPT-175B
and BLOOM
models quantized to 3 bits. It can be seen that as smaller batches are used, the perplexity decreases, which means that the quality of the quantized model is better. But it has the problem that the algorithm takes longer to run.
During the inference, something called dynamic dequantization
is performed in order to perform the inference. Each layer is dequantized as it is passed through.
To do this, they developed a kernel that de-quantifies the matrices and performs the matrix products. Although the quantization consumes more computation, the kernel has to access much less memory, which generates significant speedups.
The inference is performed in FP16 by discounting the weights as you go through the layers and the activation function of each layer is also performed in FP16. Although this means that more calculations have to be done, because the weights have to be discounted, these calculations make the overall process faster, because less data has to be fetched from memory. The weights have to be fetched from memory in fewer bits, so in the end, in matrices of many parameters it saves a lot of data. The bottleneck is usually in fetching the data from memory, so even if you have to do more calculations, in the end the inference is faster.
The authors of the paper performed a test quantizing the BLOOM-175B model to 3 bits, which occupied about 63 GB of VRAM memory, including the embeddings and the output layer that are kept in FP16. In addition maintaining the 2048 token context window consumes about 9 GB of memory, totaling about 72 GB of VRAM memory. They quantized in 3 bits and not 4 bits to be able to perform this experiment and to be able to fit the model on a single Nvidia A100 GPU with 80 GB of VRAM memory.
For comparison, normal FP16 inference requires about 350 GB of VRAM memory, which is equivalent to 5 Nvidia A100 GPUs with 80 GB of VRAM memory. And 8-bit quantizing inference using llm.int8() requires 3 such GPUs.
Below is a table with the model inference in FP16 and quantized to 3 bits and Nvidia A100 GPUs with 80 GB of VRAM memory and Nvidia A6000 with 48 GB of VRAM memory.
GPU (VRAM) | Average time per token in FP16 (ms) | Average time per token in 3 bit (ms) | Acceleration | Reduction of GPUs required |
---|---|---|---|---|
A6000 (48GB) | 589 | 130 | ×4.53 | 8→ 2 |
A100 (80GB) | 230 | 71 | 71 ×3.24 | 5→ 1 |
For example, using the kernels, the 3-bit OPT-175B model runs on a single A100 (instead of 5) and is approximately 3.25 times faster than the FP16 version in terms of average time per token.
The NVIDIA A6000 GPU has a much lower memory bandwidth, making this strategy even more effective: running the 3-bit OPT-175B model on 2 A6000 GPUs (instead of 8) is approximately 4.53 times faster than the FP16 version.
The authors of the paper implemented the GPTQ library. Other libraries were created such as GPTQ-for-LLaMa, exllama and llama.cpp. However these libraries focus only on the llama architecture, so the AutoGPTQ library gained the most popularity because it has a wider coverage of architectures.
Therefore, this library AutoGPTQ was integrated by means of an API inside the transformers library. In order to use it, it is necessary to install it as indicated in the Installation section of its repository and to have the optimun library installed.
In addition to what is indicated in the Installation section of your repository, it is also advisable to do the following:
git clone https://github.com/PanQiWei/AutoGPTQ
cd AutoGPTQ
pip install .
For the GPU quantization kernels developed by the authors of the paper to be installed.
We are going to quantize the meta-flame/Meta-Flame-3-8B-Instruct model which as its name indicates is an 8B parameter model, so in FP16 we would need 16 GB of VRAM memory. First we run the model to see the memory it occupies and the output it generates
As to use this model we have to ask permission to Meta, we log in to HuggingFace to download the tokenizer and the model.
from huggingface_hub import notebook_login
notebook_login()
Instantiate the tokenizer and the model
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
checkpoint = "meta-llama/Meta-Llama-3-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint).half().to(device)
Let’s take a look at the memory occupied in FP16
model_memory = model.get_memory_footprint()/(1024**3)
print(f"Model memory: {model_memory:.2f} GB")
We see that it occupies almost 15 GB, more or less the 16 GB that we had said it should occupy, but why this difference? Surely this model does not have exactly 8B of parameters, but it has a little less, but when indicating the number of parameters it is rounded to 8B.
We make an inference to see how it does it and how long it takes.
import time
input_tokens = tokenizer("Hello my name is Maximo and I am a Machine Learning Engineer", return_tensors="pt").to(model.device)
t0 = time.time()
max_new_tokens = 50
outputs = model.generate(
input_ids=input_tokens.input_ids,
attention_mask=input_tokens.attention_mask,
max_length=input_tokens.input_ids.shape[1] + max_new_tokens,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
print(f"Inference time: {time.time() - t0:.2f} s")
Let’s quantize it to 4 bits. I restart the notebook to avoid memory problems, so we log into Hugging Face again.
from huggingface_hub import notebook_login
notebook_login()
First I create the tokenizer
from transformers import AutoTokenizer
checkpoint = "meta-llama/Meta-Llama-3-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
Now we create the quantization configuration. As we have said, this algorithm calculates the error of the quantized weights over the original weights based on inputs from a dataset, so in the configuration we have to pass it with which dataset we want to quantize the model.
The available defaults are wikitext2
,c4
,c4-new
,ptb
and ptb-new
.
We can also create a dataset ourselves from a list of strings
dataset = ["auto-gptq is an easy-to-use model quantization library with user-friendly apis, based on GPTQ algorithm."]
We also have to tell it the number of bits that the quantized model has by means of the bits
parameter.
from transformers import GPTQConfig
quantization_config = GPTQConfig(bits=4, dataset = "c4", tokenizer=tokenizer)
We quantify the model
from transformers import AutoModelForCausalLM
import time
t0 = time.time()
model_4bits = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto", quantization_config=quantization_config)
t_quantization = time.time() - t0
print(f"Quantization time: {t_quantization:.2f} s = {t_quantization/60:.2f} min")
As the quantization process calculates the smallest error between the quantized weights and the original weights by passing inputs through each layer, the quantization process takes time. In this case it took about half an hour
Let’s see how much memory it occupies now
model_4bits_memory = model_4bits.get_memory_footprint()/(1024**3)
print(f"Model memory: {model_4bits_memory:.2f} GB")
Here we can see a benefit of quantization. While the original model occupied about 15 GB of VRAM, now the quantized model occupies about 5 GB, almost a third of the original size.
We make the inference and see how long it takes
import time
input_tokens = tokenizer("Hello my name is Maximo and I am a Machine Learning Engineer", return_tensors="pt").to(model_4bits.device)
t0 = time.time()
max_new_tokens = 50
outputs = model_4bits.generate(
input_ids=input_tokens.input_ids,
attention_mask=input_tokens.attention_mask,
max_length=input_tokens.input_ids.shape[1] + max_new_tokens,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
print(f"Inference time: {time.time() - t0:.2f} s")
The unquantized model took 4.14 seconds, while now quantized to 4 bits it has taken 2.34 seconds and also generated the text well. We have managed to reduce the inference by almost half.
As the size of the quantized model is almost one third of the model in FP16, we could think that the inference speed should be about three times faster with the quantized model. But we have to remember that in each layer the weights are quantized and the computations are performed in FP16, so we only managed to reduce the inference time by half and not by one third.
Now we save the model
save_folder = "./model_4bits/"
model_4bits.save_pretrained(save_folder)
tokenizer.save_pretrained(save_folder)
And we upload it to the hub
repo_id = "Llama-3-8B-Instruct-GPTQ-4bits"
commit_message = f"AutoGPTQ model for {checkpoint}: {quantization_config.bits}bits, gr{quantization_config.group_size}, desc_act={quantization_config.desc_act}"
model_4bits.push_to_hub(repo_id, commit_message=commit_message)
We also upload the tokenizer. Although we have not changed the tokenizer, we upload it because if a person downloads our model from the hub, he/she does not need to know which tokenizer we have used, so he/she will probably want to download the model and the tokenizer together. We can indicate in the model card which tokenizer we have used to download it, but most likely they will not read the model card, try to download the tokenizer, get an error and not know what to do. So we upload it to save us that problem.
repo_id = "Llama-3-8B-Instruct-GPTQ-4bits"
commit_message = f"Tokenizers for {checkpoint}: {quantization_config.bits}bits, gr{quantization_config.group_size}, desc_act={quantization_config.desc_act}"
tokenizer.push_to_hub(repo_id, commit_message=commit_message)
Let’s quantize it to 3 bits. I restart the notebook to avoid memory problems and log in to Hugging Face again.
from huggingface_hub import notebook_login
notebook_login()
First I create the tokenizer
from transformers import AutoTokenizer
checkpoint = "meta-llama/Meta-Llama-3-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
We create the quantization configuration, now we indicate that we want to quantize to 3 bits
from transformers import GPTQConfig
quantization_config = GPTQConfig(bits=3, dataset = "c4", tokenizer=tokenizer)
We quantify the model
from transformers import AutoModelForCausalLM
import time
t0 = time.time()
model_3bits = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto", quantization_config=quantization_config)
t_quantization = time.time() - t0
print(f"Quantization time: {t_quantization:.2f} s = {t_quantization/60:.2f} min")
As before, it has taken about half an hour.
Let’s see how much memory it occupies now
model_3bits_memory = model_3bits.get_memory_footprint()/(1024**3)
print(f"Model memory: {model_3bits_memory:.2f} GB")
The memory occupied by the 3-bit model is also almost 5 GB. The 4-bit model occupied 5.34 GB, while now in 3-bit it occupies 4.52 GB, so we have managed to reduce the size of the model a little more.
We make the inference and see how long it takes
import time
input_tokens = tokenizer("Hello my name is Maximo and I am a Machine Learning Engineer", return_tensors="pt").to(model_3bits.device)
t0 = time.time()
max_new_tokens = 50
outputs = model_3bits.generate(
input_ids=input_tokens.input_ids,
attention_mask=input_tokens.attention_mask,
max_length=input_tokens.input_ids.shape[1] + max_new_tokens,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
print(f"Inference time: {time.time() - t0:.2f} s")
Although the output on 3 bits is good, now the inference time has been 2.89 seconds, while on 4 bits it was 2.34 seconds. More testing should be done to see if it always takes less time on 4 bits, or maybe the difference is so small that sometimes the 3-bit inference is faster and sometimes the 4-bit inference is faster.
In addition, although the output makes sense, it starts to become repetitive.
Save the model
save_folder = "./model_3bits/"
model_3bits.save_pretrained(save_folder)
tokenizer.save_pretrained(save_folder)
And we upload it to the hub
repo_id = "Llama-3-8B-Instruct-GPTQ-3bits"
commit_message = f"AutoGPTQ model for {checkpoint}: {quantization_config.bits}bits, gr{quantization_config.group_size}, desc_act={quantization_config.desc_act}"
model_3bits.push_to_hub(repo_id, commit_message=commit_message)
We also upload the tokenizer
repo_id = "Llama-3-8B-Instruct-GPTQ-3bits"
commit_message = f"Tokenizers for {checkpoint}: {quantization_config.bits}bits, gr{quantization_config.group_size}, desc_act={quantization_config.desc_act}"
tokenizer.push_to_hub(repo_id, commit_message=commit_message)
Let’s quantize it to 2 bits. I restart the notebook to avoid memory problems and log in again in Hugging Face
from huggingface_hub import notebook_login
notebook_login()
First I create the tokenizer
from transformers import AutoTokenizer
checkpoint = "meta-llama/Meta-Llama-3-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
We create the quantization configuration. Now we tell it to quantize to 2 bits. In addition we must indicate how many vectors of the weight matrix quantize at a time by means of the parameter group_size
, before by default it had the value 128 and we did not touch it, but now when quantizing to 2 bits, to have less error we put a smaller value. If we leave it at 128, the quantized model would work very badly, in this case I am going to put a value of 16.
from transformers import GPTQConfig
quantization_config = GPTQConfig(bits=2, dataset = "c4", tokenizer=tokenizer, group_size=16)
from transformers import AutoModelForCausalLM
import time
t0 = time.time()
model_2bits = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto", quantization_config=quantization_config)
t_quantization = time.time() - t0
print(f"Quantization time: {t_quantization:.2f} s = {t_quantization/60:.2f} min")
We see that it has also taken about half an hour.
Let’s see how much memory it occupies now
model_2bits_memory = model_2bits.get_memory_footprint()/(1024**3)
print(f"Model memory: {model_2bits_memory:.2f} GB")
While quantized at 4 bits it occupied 5.34 GB and at 3 bits it occupied 4.52 GB, now quantized at 2 bits it occupies 4.50 GB, so we have managed to reduce the size of the model a little more.
We make the inference and see how long it takes
import time
input_tokens = tokenizer("Hello my name is Maximo and I am a Machine Learning Engineer", return_tensors="pt").to(model_2bits.device)
t0 = time.time()
max_new_tokens = 50
outputs = model_2bits.generate(
input_ids=input_tokens.input_ids,
attention_mask=input_tokens.attention_mask,
max_length=input_tokens.input_ids.shape[1] + max_new_tokens,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
print(f"Inference time: {time.time() - t0:.2f} s")
We see that the output is already not good, and the inference time is 2.92 seconds, more or less the same as with 3 and 4 bits.
Save the model
save_folder = "./model_2bits/"
model_2bits.save_pretrained(save_folder)
tokenizer.save_pretrained(save_folder)
We upload it to the hub
repo_id = "Llama-3-8B-Instruct-GPTQ-2bits"
commit_message = f"AutoGPTQ model for {checkpoint}: {quantization_config.bits}bits, gr{quantization_config.group_size}, desc_act={quantization_config.desc_act}"
model_2bits.push_to_hub(repo_id, commit_message=commit_message)
Let’s quantize it to 1 bit. I restart the notebook to avoid memory problems and log in again in Hugging Face
from huggingface_hub import notebook_login
notebook_login()
First I create the tokenizer
from transformers import AutoTokenizer
checkpoint = "meta-llama/Meta-Llama-3-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
We create the quantization configuration, now we tell it to quantize to only 1 bit and also to use a group_size
of 8
from transformers import GPTQConfig
quantization_config = GPTQConfig(bits=2, dataset = "c4", tokenizer=tokenizer, group_size=8)
from transformers import AutoModelForCausalLM
import time
t0 = time.time()
model_1bits = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto", quantization_config=quantization_config)
t_quantization = time.time() - t0
print(f"Quantization time: {t_quantization:.2f} s = {t_quantization/60:.2f} min")
We see that it also takes about half an hour to quantize
Let’s see how much memory it occupies now
model_1bits_memory = model_1bits.get_memory_footprint()/(1024**3)
print(f"Model memory: {model_1bits_memory:.2f} GB")
We see that in this case it occupies even more than quantized to 2 bits, 4.52 GB.
We make the inference and see how long it takes
import time
input_tokens = tokenizer("Hello my name is Maximo and I am a Machine Learning Engineer", return_tensors="pt").to(model_1bits.device)
t0 = time.time()
max_new_tokens = 50
outputs = model_1bits.generate(
input_ids=input_tokens.input_ids,
attention_mask=input_tokens.attention_mask,
max_length=input_tokens.input_ids.shape[1] + max_new_tokens,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
print(f"Inference time: {time.time() - t0:.2f} s")
We see that the output is very bad and also takes longer than when we have quantized to 2 bits.
Save the model
save_folder = "./model_1bits/"
model_1bits.save_pretrained(save_folder)
tokenizer.save_pretrained(save_folder)
We upload it to the hub
repo_id = "Llama-3-8B-Instruct-GPTQ-1bits"
commit_message = f"AutoGPTQ model for {checkpoint}: {quantization_config.bits}bits, gr{quantization_config.group_size}, desc_act={quantization_config.desc_act}"
model_1bits.push_to_hub(repo_id, commit_message=commit_message)
Let’s buy the quantization to 4, 3, 2 and 1 bits
Bits | Quantization time (min) | Memory (GB) | Inference time (s) | Quality of the output |
---|---|---|---|---|
FP16 | 0 | 14.96 | 4.14 | Good |
4 | 32.20 | 5.34 | 2.34 | Good |
3 | 31.88 | 4.52 | 2.89 | Good |
2 | 32.89 | 4.50 | 2.92 | Poor |
1 | 33.84 | 5.42 | 3.12 | Poor |
Looking at this table we see that it does not make sense, in this example, to quantize to less than 4 bits.
Quantizing at 1 and 2 bits clearly does not make sense because the output quality is poor.
But although the output when quantizing at 3 bits is good, it started to be repetitive, so in the long run, it would probably not be a good idea to use that model. Besides neither the savings in quantization time, the savings in VRAM and the savings in inference time is significant compared to quantizing to 4 bits.
Now that we have compared the quantization of models, let’s see how it would be done to load the 4-bit model that we have saved, since as we have seen, it is the best choice
First we load the tokenizer that we have used
from transformers import AutoTokenizer
path = "./model_4bits"
tokenizer = AutoTokenizer.from_pretrained(path)
Now we load the model we have saved
from transformers import AutoModelForCausalLM
load_model_4bits = AutoModelForCausalLM.from_pretrained(path, device_map="auto")
We see the memory occupied by
load_model_4bits_memory = load_model_4bits.get_memory_footprint()/(1024**3)
print(f"Model memory: {load_model_4bits_memory:.2f} GB")
We see that it occupies the same memory as when we quantized it, which is logical.
We make the inference and see how long it takes
import time
input_tokens = tokenizer("Hello my name is Maximo and I am a Machine Learning Engineer", return_tensors="pt").to(load_model_4bits.device)
t0 = time.time()
max_new_tokens = 50
outputs = load_model_4bits.generate(
input_ids=input_tokens.input_ids,
attention_mask=input_tokens.attention_mask,
max_length=input_tokens.input_ids.shape[1] + max_new_tokens,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
print(f"Inference time: {time.time() - t0:.2f} s")
We see that the inference is good and it took 3.82 seconds, a little longer than when we quantized it. But as I said before, we would have to do this test many times and take an average.
Now we see how to load the 4-bit model that we have uploaded to the hub
First we load the tokenizer that we have uploaded
from transformers import AutoTokenizer
checkpoint = "Maximofn/Llama-3-8B-Instruct-GPTQ-4bits"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
Now we load the model we have saved
from transformers import AutoModelForCausalLM
load_model_4bits = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto")
We see the memory occupied by
load_model_4bits_memory = load_model_4bits.get_memory_footprint()/(1024**3)
print(f"Model memory: {load_model_4bits_memory:.2f} GB")
It also occupies the same memory
We make the inference and see how long it takes
import time
input_tokens = tokenizer("Hello my name is Maximo and I am a Machine Learning Engineer", return_tensors="pt").to(load_model_4bits.device)
t0 = time.time()
max_new_tokens = 50
outputs = load_model_4bits.generate(
input_ids=input_tokens.input_ids,
attention_mask=input_tokens.attention_mask,
max_length=input_tokens.input_ids.shape[1] + max_new_tokens,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
print(f"Inference time: {time.time() - t0:.2f} s")
We see that the inference is also good and took 3.81 seconds.