This notebook has been automatically translated to make it accessible to more people, please let me know if you see any typos.

In the post LLMs quantization we explained the importance of quantization of LLMs to save memory. We also explained that there is a way of quantization which is zero-point quantization that consists in transforming the parameter values of the weights linearly, but this has the problem of the degradation of the language models from the moment they exceed 2.7B parameters.

Since the quantization of all the parameters of the models produces errors in the large language models, what they propose in the paper llm.int8() is to perform vector quantization, that is, to separate the matrices of the weights into vectors, so that some of these vectors can be quantized in 8 bits, while others cannot. So those that can be quantized in 8 bits are quantized and matrix multiplications are performed in INT8 format, while those vectors that cannot be quantized are kept in FP16 format and multiplications are performed in FP16 format.

Let’s see it with an example

Suppose we have the matrix

and we want to multiply it by the matrix

We set a threshold value and all the columns of the first matrix that have a value greater than that threshold are left in FP16 format, the rows equivalent to the rows of the first matrix, in the second matrix are also left in FP16 format.

Let me explain it more clearly, as the second and fourth columns of the first matrix (yellow columns) have values greater than a certain threshold, then the second and fourth rows of the second matrix (yellow rows) are left in FP16 format.

In case of having threshold values in the second matrix the same would be done, for example, if in the second matrix a row had a value greater than a threshold, it would be left in FP16 format, and that column in the first matrix would be left in FP16 format.

The remaining rows and columns that are not left in FP16 format are quantized in 8 bits and multiplications are performed in INT8 format.

So we separate the first matrix into the two matrices

And the second matrix in the two matrices

We multiply the matrices in INT8 on one side

and those in FP16 format on the other hand

As you can see, multiplying the matrices in INT8 format gives us a matrix of size 3×2, and multiplying the matrices in FP16 format gives us another matrix of size 3×2, so if we add them together

Interestingly, it gives us the same result as if we had multiplied the original matrices

In order to see why this happens, if we develop the vector product of the two original matrices

We see that the separation we have made does not cause problems.

Therefore, we can conclude that we can separate rows and columns of matrices to perform matrix multiplications. This separation will be done when any element of the row or column is greater than a threshold value, so that the rows or columns that do not have a value greater than that threshold will be coded in INT8 occupying only one byte and the rows or columns that have an element greater than that threshold will be passed to FP16 occupying 2 bytes. In this way we will not have rounding problems, since the calculations we do in INT8 will be done with values that do not exceed the range of 8 bits.

As we have said we are going to separate into rows and columns that have some element greater than a threshold value, but what threshold value should we choose? The authors of the paper did experiments with several values and determined that this threshold value should be α=6. Above that value they started to get degradations in the language models.

Let’s see how to quantize a model with llm.int8() with the transformers library. To do this you must have

installed.**bitsandbytes**

`pip install bitsandbytes`

We load a 1B parameter model twice, once normally and the second time quantizing it with llm.int8().

from transformers import AutoModelForCausalLM, AutoTokenizer

import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

checkpoint = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

tokenizer = AutoTokenizer.from_pretrained(checkpoint)

model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)

model_8bit = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto", load_in_8bit=True)

Let’s see how much memory each model takes up

model.get_memory_footprint()/(1024**3), model_8bit.get_memory_footprint()/(1024**3)

As can be seen, the quantized model occupies much less memory.

Let’s now do a text generation test with the two models

input_tokens = tokenizer("Hello my name is Maximo and I am a Machine Learning Engineer", return_tensors="pt").to(device)

input_tokens.input_ids

We see the output with the normal model

import time

t0 = time.time()

max_new_tokens = 50

outputs = model.generate(

input_ids=input_tokens.input_ids,

attention_mask=input_tokens.attention_mask,

max_length=input_tokens.input_ids.shape[1] + max_new_tokens,

)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

print(time.time() - t0)

And now with the quantized model

t0 = time.time()

max_new_tokens = 50

outputs = model_8bit.generate(

input_ids=input_tokens.input_ids,

attention_mask=input_tokens.attention_mask,

max_length=input_tokens.input_ids.shape[1] + max_new_tokens,

)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

print(time.time() - t0)

We see two things, on the one hand we get the same text at the output, so with a much smaller model we can get the same output, however the quantized model takes much longer to run, so if you need to use this model in real time it would not be advisable.

This is contradictory, because we might think that a smaller model would have to run faster, but we have to think that in reality the two models, the normal and the quantized, perform the same operations, only that one performs all the operations in FP32 and the other does them in INT8 and FP16, however the quantized model has to search for rows and columns with values greater than the threshold value, separate them, perform the operations in INT8 and FP16 and then put the results back together, so the quantized model takes longer to run.