GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Attention developers! 🚨 Do you have a language model that is too big and heavy for your application? 🤯 Don't worry, GPTQ is here to help you! 🤖 This quantization algorithm is like a wizard that makes unnecessary bits and bytes disappear, reducing the size of your model without losing too much precision. 🎩 It's like compressing a file without losing quality - it's a way to make your models more efficient and faster! 🚀

Open In Colab

In the paper GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers the need to create a post-training quantization method that does not degrade the quality of the model is exposed. In this post we have seen the method llm.int8() that quantizes to INT8 some vectors of the weight matrices, as long as none of their values exceeds a threshold value, which is fine, but they do not quantize all the weights of the model. In this paper they propose a method that quantizes all the weights of the model to 4 and 3 bits, without degrading the quality of the model. This saves considerable memory, not only because all the weights are quantized, but also because it is done at 4, 3 bits (and even 1 and 2 bits under certain conditions), instead of 8 bits.

This notebook has been automatically translated to make it accessible to more people, please let me know if you see any typos.

## Works on which it is based

On the one hand, they are based on the works Nagel et al., 2020; Wang et al., 2020; Hubara et al., 2021 and Frantar et al., 2022, which propose to quantize the weights of the layers of a neural network to 4 and 3 bits, without degrading the quality of the model.

Having a data set m coming from a dataset, to each layer l you put the data and you get the output of the weights W of that layer. So what we do is to look for new quantized Ŵ weights that minimize the quadratic error in relation to the output of the total accuracy layer

argmin_Ŵ||WX- ŴX|||^2

The values of Ŵ are set before performing the quantization process and during the process, each parameter of Ŵ can change its value independently without depending on the value of the other parameters of Ŵ.

In the OBQ work of Frantar et al., 2022 they optimize the above layered quantization process, making it up to 3 times faster. This helps with large models, as quantizing a large model can take a long time.

The OBQ method is an approach to solve the layered quantization problem in language models. OBQ starts from the idea that the quadratic error can be decomposed into the sum of individual errors for each row of the weight matrix. The method then quantizes each weight independently, always updating the unquantized weights to compensate for the error incurred by the quantization.

The method is capable of quantifying medium-sized models in reasonable times, but as it is a cubic complexity algorithm it is extremely expensive to apply to models with billions of parameters.

In OBQ they searched for the row of weights that created the lowest mean square error to quantize, but they realized that doing it randomly did not increase the final mean square error much. So instead of looking for the row that minimizes the mean square error, which created a cubic complexity in the algorithm, it is always done in the same order. Thanks to this, the execution time of the quantization algorithm is greatly reduced.

As the update of the weights is done row by row, it makes it a slow process and does not take full advantage of the hardware. Therefore, they propose to perform the updates in batches of B=128 rows. This makes better use of the hardware and reduces the execution time of the algorithm.

The problem with doing batch updates is that, due to the large scale of the models, numerical errors can occur that affect the accuracy of the algorithm. Specifically, undefined matrices can be obtained, causing the algorithm to update the remaining weights in the wrong directions, resulting in very poor quantization.

To solve this, the authors of the paper propose to use a Cholesky reformulation, which is a more numerically stable method.

Below are two graphs with the perplexity measure in the WikiText2 dataset for all sizes of the OPT and BLOOM models. It can be seen that with the RTN quantization technique, the perplexity at some sizes increases a lot, while with GPTQ it remains similar to that obtained with the FP16 model.

GPTQ-figure1

Other plots are shown below, but with the measurement of the accuracy in the LAMBADA dataset. The same thing happens, while GPTQ remains similar to that obtained with FP16, other quantization methods greatly degrade the quality of the model.

GPTQ-figure3

In the previous graphs we have shown the results of quantizing the model to 3 and 4 bits, but we can quantize them to 2 bits, and even to 1 bit.

By modifying the size of the batches when using the algorithm we can obtain good results quantizing both the model

Modelo FP16 g128 g64 g32 3 bits
OPT-175B 8.34 9.58 9.18 8.94 8.68
BLOOM 8.11 9.55 9.17 8.83 8.64

In the table above you can see the result of the perplexity in the WikiText2 dataset for the OPT-175B and BLOOM models quantized to 3 bits. It can be seen that as smaller batches are used, the perplexity decreases, which means that the quality of the quantized model is better. But it has the problem that the algorithm takes longer to run.

During the inference, something called dynamic dequantization is performed in order to perform the inference. Each layer is dequantized as it is passed through.

To do this, they developed a kernel that de-quantifies the matrices and performs the matrix products. Although the quantization consumes more computation, the kernel has to access much less memory, which generates significant speedups.

The inference is performed in FP16 by discounting the weights as you go through the layers and the activation function of each layer is also performed in FP16. Although this means that more calculations have to be done, because the weights have to be discounted, these calculations make the overall process faster, because less data has to be fetched from memory. The weights have to be fetched from memory in fewer bits, so in the end, in matrices of many parameters it saves a lot of data. The bottleneck is usually in fetching the data from memory, so even if you have to do more calculations, in the end the inference is faster.

The authors of the paper performed a test quantizing the BLOOM-175B model to 3 bits, which occupied about 63 GB of VRAM memory, including the embeddings and the output layer that are kept in FP16. In addition maintaining the 2048 token context window consumes about 9 GB of memory, totaling about 72 GB of VRAM memory. They quantized in 3 bits and not 4 bits to be able to perform this experiment and to be able to fit the model on a single Nvidia A100 GPU with 80 GB of VRAM memory.

For comparison, normal FP16 inference requires about 350 GB of VRAM memory, which is equivalent to 5 Nvidia A100 GPUs with 80 GB of VRAM memory. And 8-bit quantizing inference using llm.int8() requires 3 such GPUs.

Below is a table with the model inference in FP16 and quantized to 3 bits and Nvidia A100 GPUs with 80 GB of VRAM memory and Nvidia A6000 with 48 GB of VRAM memory.

GPU (VRAM) Average time per token in FP16 (ms) Average time per token in 3 bit (ms) Acceleration Reduction of GPUs required
A6000 (48GB) 589 130 ×4.53 8→ 2
A100 (80GB) 230 71 71 ×3.24 5→ 1

For example, using the kernels, the 3-bit OPT-175B model runs on a single A100 (instead of 5) and is approximately 3.25 times faster than the FP16 version in terms of average time per token.

The NVIDIA A6000 GPU has a much lower memory bandwidth, making this strategy even more effective: running the 3-bit OPT-175B model on 2 A6000 GPUs (instead of 8) is approximately 4.53 times faster than the FP16 version.

The authors of the paper implemented the GPTQ library. Other libraries were created such as GPTQ-for-LLaMa, exllama and llama.cpp. However these libraries focus only on the llama architecture, so the AutoGPTQ library gained the most popularity because it has a wider coverage of architectures.

Therefore, this library AutoGPTQ was integrated by means of an API inside the transformers library. In order to use it, it is necessary to install it as indicated in the Installation section of its repository and to have the optimun library installed.

In addition to what is indicated in the Installation section of your repository, it is also advisable to do the following:

git clone https://github.com/PanQiWei/AutoGPTQ
cd AutoGPTQ
pip install .

For the GPU quantization kernels developed by the authors of the paper to be installed.

Let’s see how to quantize a model with the optimun library and the AutoGPTQ API.

We are going to quantize the meta-flame/Meta-Flame-3-8B-Instruct model which as its name indicates is an 8B parameter model, so in FP16 we would need 16 GB of VRAM memory. First we run the model to see the memory it occupies and the output it generates

As to use this model we have to ask permission to Meta, we log in to HuggingFace to download the tokenizer and the model.

Code:

from huggingface_hub import notebook_login

notebook_login()

Instantiate the tokenizer and the model

Code:

from transformers import AutoModelForCausalLM, AutoTokenizer

import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

checkpoint = "meta-llama/Meta-Llama-3-8B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(checkpoint)

model = AutoModelForCausalLM.from_pretrained(checkpoint).half().to(device)

Let’s take a look at the memory occupied in FP16

Code:

model_memory = model.get_memory_footprint()/(1024**3)

print(f"Model memory: {model_memory:.2f} GB")

Output:

Model memory: 14.96 GB

We see that it occupies almost 15 GB, more or less the 16 GB that we had said it should occupy, but why this difference? Surely this model does not have exactly 8B of parameters, but it has a little less, but when indicating the number of parameters it is rounded to 8B.

We make an inference to see how it does it and how long it takes.

Code:

import time

input_tokens = tokenizer("Hello my name is Maximo and I am a Machine Learning Engineer", return_tensors="pt").to(model.device)

t0 = time.time()

max_new_tokens = 50

outputs = model.generate(

input_ids=input_tokens.input_ids,

attention_mask=input_tokens.attention_mask,

max_length=input_tokens.input_ids.shape[1] + max_new_tokens,

)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

print(f"Inference time: {time.time() - t0:.2f} s")

Output:

Hello my name is Maximo and I am a Machine Learning Engineer at a startup in the Bay Area. I am passionate about building AI systems that can help humans make better decisions and improve their lives.

I have a background in computer science and mathematics, and I have been working with machine learning for several years. I

Inference time: 4.14 s

Let’s quantize it to 4 bits. I restart the notebook to avoid memory problems, so we log into Hugging Face again.

Code:

from huggingface_hub import notebook_login

notebook_login()

First I create the tokenizer

Code:

from transformers import AutoTokenizer

checkpoint = "meta-llama/Meta-Llama-3-8B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(checkpoint)

Output:

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

Now we create the quantization configuration. As we have said, this algorithm calculates the error of the quantized weights over the original weights based on inputs from a dataset, so in the configuration we have to pass it with which dataset we want to quantize the model.

The available defaults are wikitext2,c4,c4-new,ptb and ptb-new.

We can also create a dataset ourselves from a list of strings

dataset = ["auto-gptq is an easy-to-use model quantization library with user-friendly apis, based on GPTQ algorithm."]

We also have to tell it the number of bits that the quantized model has by means of the bits parameter.

Code:

from transformers import GPTQConfig

quantization_config = GPTQConfig(bits=4, dataset = "c4", tokenizer=tokenizer)

We quantify the model

Code:

from transformers import AutoModelForCausalLM

import time

t0 = time.time()

model_4bits = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto", quantization_config=quantization_config)

t_quantization = time.time() - t0

print(f"Quantization time: {t_quantization:.2f} s = {t_quantization/60:.2f} min")

Output:

Quantization time: 1932.09 s = 32.20 min

As the quantization process calculates the smallest error between the quantized weights and the original weights by passing inputs through each layer, the quantization process takes time. In this case it took about half an hour

Let’s see how much memory it occupies now

Code:

model_4bits_memory = model_4bits.get_memory_footprint()/(1024**3)

print(f"Model memory: {model_4bits_memory:.2f} GB")

Output:

Model memory: 5.34 GB

Here we can see a benefit of quantization. While the original model occupied about 15 GB of VRAM, now the quantized model occupies about 5 GB, almost a third of the original size.

We make the inference and see how long it takes

Code:

import time

input_tokens = tokenizer("Hello my name is Maximo and I am a Machine Learning Engineer", return_tensors="pt").to(model_4bits.device)

t0 = time.time()

max_new_tokens = 50

outputs = model_4bits.generate(

input_ids=input_tokens.input_ids,

attention_mask=input_tokens.attention_mask,

max_length=input_tokens.input_ids.shape[1] + max_new_tokens,

)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

print(f"Inference time: {time.time() - t0:.2f} s")

Output:

Hello my name is Maximo and I am a Machine Learning Engineer. I have a strong background in computer science and mathematics, and I am passionate about developing innovative solutions that can positively impact society. I am excited to be a part of this community and to learn from and contribute to the discussions here. I am particularly

Inference time: 2.34 s

The unquantized model took 4.14 seconds, while now quantized to 4 bits it has taken 2.34 seconds and also generated the text well. We have managed to reduce the inference by almost half.

As the size of the quantized model is almost one third of the model in FP16, we could think that the inference speed should be about three times faster with the quantized model. But we have to remember that in each layer the weights are quantized and the computations are performed in FP16, so we only managed to reduce the inference time by half and not by one third.

Now we save the model

Code:

save_folder = "./model_4bits/"

model_4bits.save_pretrained(save_folder)

tokenizer.save_pretrained(save_folder)

Output:

('./model_4bits/tokenizer_config.json',

'./model_4bits/special_tokens_map.json',

'./model_4bits/tokenizer.json')

And we upload it to the hub

Code:

repo_id = "Llama-3-8B-Instruct-GPTQ-4bits"

commit_message = f"AutoGPTQ model for {checkpoint}: {quantization_config.bits}bits, gr{quantization_config.group_size}, desc_act={quantization_config.desc_act}"

model_4bits.push_to_hub(repo_id, commit_message=commit_message)

Output:

CommitInfo(commit_url='https://huggingface.co/Maximofn/Llama-3-8B-Instruct-GPTQ-4bits/commit/44cfdcad78db260122943d3f57858c1b840bda17', commit_message='AutoGPTQ model for meta-llama/Meta-Llama-3-8B-Instruct: 4bits, gr128, desc_act=False', commit_description='', oid='44cfdcad78db260122943d3f57858c1b840bda17', pr_url=None, pr_revision=None, pr_num=None)

We also upload the tokenizer. Although we have not changed the tokenizer, we upload it because if a person downloads our model from the hub, he/she does not need to know which tokenizer we have used, so he/she will probably want to download the model and the tokenizer together. We can indicate in the model card which tokenizer we have used to download it, but most likely they will not read the model card, try to download the tokenizer, get an error and not know what to do. So we upload it to save us that problem.

Code:

repo_id = "Llama-3-8B-Instruct-GPTQ-4bits"

commit_message = f"Tokenizers for {checkpoint}: {quantization_config.bits}bits, gr{quantization_config.group_size}, desc_act={quantization_config.desc_act}"

tokenizer.push_to_hub(repo_id, commit_message=commit_message)

Output:

CommitInfo(commit_url='https://huggingface.co/Maximofn/Llama-3-8B-Instruct-GPTQ-4bits/commit/75600041ca6e38b5f1fb912ad1803b66656faae4', commit_message='Tokenizers for meta-llama/Meta-Llama-3-8B-Instruct: 4bits, gr128, desc_act=False', commit_description='', oid='75600041ca6e38b5f1fb912ad1803b66656faae4', pr_url=None, pr_revision=None, pr_num=None)

Let’s quantize it to 3 bits. I restart the notebook to avoid memory problems and log in to Hugging Face again.

Code:

from huggingface_hub import notebook_login

notebook_login()

First I create the tokenizer

Code:

from transformers import AutoTokenizer

checkpoint = "meta-llama/Meta-Llama-3-8B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(checkpoint)

Output:

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

We create the quantization configuration, now we indicate that we want to quantize to 3 bits

Code:

from transformers import GPTQConfig

quantization_config = GPTQConfig(bits=3, dataset = "c4", tokenizer=tokenizer)

We quantify the model

Code:

from transformers import AutoModelForCausalLM

import time

t0 = time.time()

model_3bits = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto", quantization_config=quantization_config)

t_quantization = time.time() - t0

print(f"Quantization time: {t_quantization:.2f} s = {t_quantization/60:.2f} min")

Output:

Quantization time: 1912.69 s = 31.88 min

As before, it has taken about half an hour.

Let’s see how much memory it occupies now

Code:

model_3bits_memory = model_3bits.get_memory_footprint()/(1024**3)

print(f"Model memory: {model_3bits_memory:.2f} GB")

Output:

Model memory: 4.52 GB

The memory occupied by the 3-bit model is also almost 5 GB. The 4-bit model occupied 5.34 GB, while now in 3-bit it occupies 4.52 GB, so we have managed to reduce the size of the model a little more.

We make the inference and see how long it takes

Code:

import time

input_tokens = tokenizer("Hello my name is Maximo and I am a Machine Learning Engineer", return_tensors="pt").to(model_3bits.device)

t0 = time.time()

max_new_tokens = 50

outputs = model_3bits.generate(

input_ids=input_tokens.input_ids,

attention_mask=input_tokens.attention_mask,

max_length=input_tokens.input_ids.shape[1] + max_new_tokens,

)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

print(f"Inference time: {time.time() - t0:.2f} s")

Output:

Hello my name is Maximo and I am a Machine Learning Engineer at Google. I am excited to be here today to talk about my work in the field of Machine Learning and to share some of the insights I have gained through my experiences.

I am a Machine Learning Engineer at Google, and I am excited to be

Inference time: 2.89 s

Although the output on 3 bits is good, now the inference time has been 2.89 seconds, while on 4 bits it was 2.34 seconds. More testing should be done to see if it always takes less time on 4 bits, or maybe the difference is so small that sometimes the 3-bit inference is faster and sometimes the 4-bit inference is faster.

In addition, although the output makes sense, it starts to become repetitive.

Save the model

Code:

save_folder = "./model_3bits/"

model_3bits.save_pretrained(save_folder)

tokenizer.save_pretrained(save_folder)

Output:

('./model_3bits/tokenizer_config.json',

'./model_3bits/special_tokens_map.json',

'./model_3bits/tokenizer.json')

And we upload it to the hub

Code:

repo_id = "Llama-3-8B-Instruct-GPTQ-3bits"

commit_message = f"AutoGPTQ model for {checkpoint}: {quantization_config.bits}bits, gr{quantization_config.group_size}, desc_act={quantization_config.desc_act}"

model_3bits.push_to_hub(repo_id, commit_message=commit_message)

Output:

CommitInfo(commit_url='https://huggingface.co/Maximofn/Llama-3-8B-Instruct-GPTQ-3bits/commit/422fd94a031234c10224ddbe09c0e029a5e9c01f', commit_message='AutoGPTQ model for meta-llama/Meta-Llama-3-8B-Instruct: 3bits, gr128, desc_act=False', commit_description='', oid='422fd94a031234c10224ddbe09c0e029a5e9c01f', pr_url=None, pr_revision=None, pr_num=None)

We also upload the tokenizer

Code:

repo_id = "Llama-3-8B-Instruct-GPTQ-3bits"

commit_message = f"Tokenizers for {checkpoint}: {quantization_config.bits}bits, gr{quantization_config.group_size}, desc_act={quantization_config.desc_act}"

tokenizer.push_to_hub(repo_id, commit_message=commit_message)

Output:

CommitInfo(commit_url='https://huggingface.co/Maximofn/Llama-3-8B-Instruct-GPTQ-4bits/commit/75600041ca6e38b5f1fb912ad1803b66656faae4', commit_message='Tokenizers for meta-llama/Meta-Llama-3-8B-Instruct: 4bits, gr128, desc_act=False', commit_description='', oid='75600041ca6e38b5f1fb912ad1803b66656faae4', pr_url=None, pr_revision=None, pr_num=None)

Let’s quantize it to 2 bits. I restart the notebook to avoid memory problems and log in again in Hugging Face

Code:

from huggingface_hub import notebook_login

notebook_login()

First I create the tokenizer

Code:

from transformers import AutoTokenizer

checkpoint = "meta-llama/Meta-Llama-3-8B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(checkpoint)

Output:

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

We create the quantization configuration. Now we tell it to quantize to 2 bits. In addition we must indicate how many vectors of the weight matrix quantize at a time by means of the parameter group_size, before by default it had the value 128 and we did not touch it, but now when quantizing to 2 bits, to have less error we put a smaller value. If we leave it at 128, the quantized model would work very badly, in this case I am going to put a value of 16.

Code:

from transformers import GPTQConfig

quantization_config = GPTQConfig(bits=2, dataset = "c4", tokenizer=tokenizer, group_size=16)

Code:

from transformers import AutoModelForCausalLM

import time

t0 = time.time()

model_2bits = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto", quantization_config=quantization_config)

t_quantization = time.time() - t0

print(f"Quantization time: {t_quantization:.2f} s = {t_quantization/60:.2f} min")

Output:

Quantization time: 1973.12 s = 32.89 min

We see that it has also taken about half an hour.

Let’s see how much memory it occupies now

Code:

model_2bits_memory = model_2bits.get_memory_footprint()/(1024**3)

print(f"Model memory: {model_2bits_memory:.2f} GB")

Output:

Model memory: 4.50 GB

While quantized at 4 bits it occupied 5.34 GB and at 3 bits it occupied 4.52 GB, now quantized at 2 bits it occupies 4.50 GB, so we have managed to reduce the size of the model a little more.

We make the inference and see how long it takes

Code:

import time

input_tokens = tokenizer("Hello my name is Maximo and I am a Machine Learning Engineer", return_tensors="pt").to(model_2bits.device)

t0 = time.time()

max_new_tokens = 50

outputs = model_2bits.generate(

input_ids=input_tokens.input_ids,

attention_mask=input_tokens.attention_mask,

max_length=input_tokens.input_ids.shape[1] + max_new_tokens,

)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

print(f"Inference time: {time.time() - t0:.2f} s")

Output:

Hello my name is Maximo and I am a Machine Learning Engineer. # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #

Inference time: 2.92 s

We see that the output is already not good, and the inference time is 2.92 seconds, more or less the same as with 3 and 4 bits.

Save the model

Code:

save_folder = "./model_2bits/"

model_2bits.save_pretrained(save_folder)

tokenizer.save_pretrained(save_folder)

Output:

('./model_2bits/tokenizer_config.json',

'./model_2bits/special_tokens_map.json',

'./model_2bits/tokenizer.json')

We upload it to the hub

Code:

repo_id = "Llama-3-8B-Instruct-GPTQ-2bits"

commit_message = f"AutoGPTQ model for {checkpoint}: {quantization_config.bits}bits, gr{quantization_config.group_size}, desc_act={quantization_config.desc_act}"

model_2bits.push_to_hub(repo_id, commit_message=commit_message)

Output:

CommitInfo(commit_url='https://huggingface.co/Maximofn/Llama-3-8B-Instruct-GPTQ-2bits/commit/13ede006ce0dbbd8aca54212e960eff98ea5ec63', commit_message='AutoGPTQ model for meta-llama/Meta-Llama-3-8B-Instruct: 2bits, gr16, desc_act=False', commit_description='', oid='13ede006ce0dbbd8aca54212e960eff98ea5ec63', pr_url=None, pr_revision=None, pr_num=None)

Let’s quantize it to 1 bit. I restart the notebook to avoid memory problems and log in again in Hugging Face

Code:

from huggingface_hub import notebook_login

notebook_login()

First I create the tokenizer

Code:

from transformers import AutoTokenizer

checkpoint = "meta-llama/Meta-Llama-3-8B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(checkpoint)

Output:

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

We create the quantization configuration, now we tell it to quantize to only 1 bit and also to use a group_size of 8

Code:

from transformers import GPTQConfig

quantization_config = GPTQConfig(bits=2, dataset = "c4", tokenizer=tokenizer, group_size=8)

Code:

from transformers import AutoModelForCausalLM

import time

t0 = time.time()

model_1bits = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto", quantization_config=quantization_config)

t_quantization = time.time() - t0

print(f"Quantization time: {t_quantization:.2f} s = {t_quantization/60:.2f} min")

Output:

Quantization time: 2030.38 s = 33.84 min

We see that it also takes about half an hour to quantize

Let’s see how much memory it occupies now

Code:

model_1bits_memory = model_1bits.get_memory_footprint()/(1024**3)

print(f"Model memory: {model_1bits_memory:.2f} GB")

Output:

Model memory: 5.42 GB

We see that in this case it occupies even more than quantized to 2 bits, 4.52 GB.

We make the inference and see how long it takes

Code:

import time

input_tokens = tokenizer("Hello my name is Maximo and I am a Machine Learning Engineer", return_tensors="pt").to(model_1bits.device)

t0 = time.time()

max_new_tokens = 50

outputs = model_1bits.generate(

input_ids=input_tokens.input_ids,

attention_mask=input_tokens.attention_mask,

max_length=input_tokens.input_ids.shape[1] + max_new_tokens,

)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

print(f"Inference time: {time.time() - t0:.2f} s")

Output:

Hello my name is Maximo and I am a Machine Learning Engineerimerszuimersimerspinsimersimersingoingoimersurosimersimersimersoleningoimersingopinsimersbirpinsimersimersimersorgeingoimersiringimersimersimersimersimersimersimersンディorge_REFERER ingest羊imersorgeimersimersendetingoШАhandsingo

Inference time: 3.12 s

We see that the output is very bad and also takes longer than when we have quantized to 2 bits.

Save the model

Code:

save_folder = "./model_1bits/"

model_1bits.save_pretrained(save_folder)

tokenizer.save_pretrained(save_folder)

Output:

('./model_1bits/tokenizer_config.json',

'./model_1bits/special_tokens_map.json',

'./model_1bits/tokenizer.json')

We upload it to the hub

Code:

repo_id = "Llama-3-8B-Instruct-GPTQ-1bits"

commit_message = f"AutoGPTQ model for {checkpoint}: {quantization_config.bits}bits, gr{quantization_config.group_size}, desc_act={quantization_config.desc_act}"

model_1bits.push_to_hub(repo_id, commit_message=commit_message)

Output:

CommitInfo(commit_url='https://huggingface.co/Maximofn/Llama-3-8B-Instruct-GPTQ-2bits/commit/e59ccffc03247e7dcc418f98b482cc02dc7a168d', commit_message='AutoGPTQ model for meta-llama/Meta-Llama-3-8B-Instruct: 2bits, gr8, desc_act=False', commit_description='', oid='e59ccffc03247e7dcc418f98b482cc02dc7a168d', pr_url=None, pr_revision=None, pr_num=None)

Let’s buy the quantization to 4, 3, 2 and 1 bits

Bits Quantization time (min) Memory (GB) Inference time (s) Quality of the output
FP16 0 14.96 4.14 Good
4 32.20 5.34 2.34 Good
3 31.88 4.52 2.89 Good
2 32.89 4.50 2.92 Poor
1 33.84 5.42 3.12 Poor

Looking at this table we see that it does not make sense, in this example, to quantize to less than 4 bits.

Quantizing at 1 and 2 bits clearly does not make sense because the output quality is poor.

But although the output when quantizing at 3 bits is good, it started to be repetitive, so in the long run, it would probably not be a good idea to use that model. Besides neither the savings in quantization time, the savings in VRAM and the savings in inference time is significant compared to quantizing to 4 bits.

Now that we have compared the quantization of models, let’s see how it would be done to load the 4-bit model that we have saved, since as we have seen, it is the best choice

First we load the tokenizer that we have used

Code:

from transformers import AutoTokenizer

path = "./model_4bits"

tokenizer = AutoTokenizer.from_pretrained(path)

Output:

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

Now we load the model we have saved

Code:

from transformers import AutoModelForCausalLM

load_model_4bits = AutoModelForCausalLM.from_pretrained(path, device_map="auto")

Output:

Loading checkpoint shards: 100%|██████████| 2/2 [00:00<?, ?it/s]

We see the memory occupied by

Code:

load_model_4bits_memory = load_model_4bits.get_memory_footprint()/(1024**3)

print(f"Model memory: {load_model_4bits_memory:.2f} GB")

Output:

Model memory: 5.34 GB

We see that it occupies the same memory as when we quantized it, which is logical.

We make the inference and see how long it takes

Code:

import time

input_tokens = tokenizer("Hello my name is Maximo and I am a Machine Learning Engineer", return_tensors="pt").to(load_model_4bits.device)

t0 = time.time()

max_new_tokens = 50

outputs = load_model_4bits.generate(

input_ids=input_tokens.input_ids,

attention_mask=input_tokens.attention_mask,

max_length=input_tokens.input_ids.shape[1] + max_new_tokens,

)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

print(f"Inference time: {time.time() - t0:.2f} s")

Output:

Hello my name is Maximo and I am a Machine Learning Engineer. I have a strong background in computer science and mathematics, and I have been working with machine learning models for several years. I am excited to be a part of this community and to share my knowledge and experience with others. I am particularly interested in

Inference time: 3.82 s

We see that the inference is good and it took 3.82 seconds, a little longer than when we quantized it. But as I said before, we would have to do this test many times and take an average.

Now we see how to load the 4-bit model that we have uploaded to the hub

First we load the tokenizer that we have uploaded

Code:

from transformers import AutoTokenizer

checkpoint = "Maximofn/Llama-3-8B-Instruct-GPTQ-4bits"

tokenizer = AutoTokenizer.from_pretrained(checkpoint)

Output:

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

Now we load the model we have saved

Code:

from transformers import AutoModelForCausalLM

load_model_4bits = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto")

We see the memory occupied by

Code:

load_model_4bits_memory = load_model_4bits.get_memory_footprint()/(1024**3)

print(f"Model memory: {load_model_4bits_memory:.2f} GB")

Output:

Model memory: 5.34 GB

It also occupies the same memory

We make the inference and see how long it takes

Code:

import time

input_tokens = tokenizer("Hello my name is Maximo and I am a Machine Learning Engineer", return_tensors="pt").to(load_model_4bits.device)

t0 = time.time()

max_new_tokens = 50

outputs = load_model_4bits.generate(

input_ids=input_tokens.input_ids,

attention_mask=input_tokens.attention_mask,

max_length=input_tokens.input_ids.shape[1] + max_new_tokens,

)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

print(f"Inference time: {time.time() - t0:.2f} s")

Output:

Hello my name is Maximo and I am a Machine Learning Engineer with a passion for building innovative AI solutions. I have been working in the field of AI for over 5 years, and have gained extensive experience in developing and implementing machine learning models for various industries.

In my free time, I enjoy reading books on

Inference time: 3.81 s

We see that the inference is also good and took 3.81 seconds.

Continue Reading
DoLa-thumbnail

DoLa – Decoding by Contrasting Layers Improves Factuality in Large Language Models

Have you ever talked to an LLM and they answered you something that sounds like they’ve been drinking machine coffee …
QLoRA_thumbnail_EN

QLoRA: Efficient Finetuning of Quantized LLMs

Hello everyone! 🤗 Today we are going to talk about QLoRA, the technique that will allow you to make your …
GPTQ-thumbnail-shot

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Attention developers! 🚨 Do you have a language model that is too big and heavy for your application? 🤯 Don’t …
llm.int8()-thumbnail

llm.int8() – 8-bit Matrix Multiplication for Transformers at Scale

Get ready to save space and speed up your models! 💥 In this post, I’m going to explore the llm.int8() …
quantization-thumbnail

LLMs quantization

Imagine having a giant language model that can answer any question, from the capital of France to the perfect brownie …
LoRA_thumbnail_EN

LoRA – low rank adaptation of large language models

Get ready to take your model adaptation to the next level with LoRA! 🚀 This low-rank adaptation technique is like …
Resumen
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
Nombre del artículo
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
Descripción
Attention developers! 🚨 Do you have a language model that is too big and heavy for your application? 🤯 Don't worry, GPTQ is here to help you! 🤖 This quantization algorithm is like a wizard that makes unnecessary bits and bytes disappear, reducing the size of your model without losing too much precision. 🎩 It's like compressing a file without losing quality - it's a way to make your models more efficient and faster! 🚀
MaximoFN
MaximoFN
MaximoFN
Publisher Logo