llm.int8() – 8-bit Matrix Multiplication for Transformers at Scale

23 of july of 2024

llm.int8() - 8-bit Matrix Multiplication for Transformers at Scale

This notebook has been automatically translated to make it accessible to more people, please let me know if you see any typos.

In the post LLMs quantization we explained the importance of quantization of LLMs to save memory. We also explained that there is a way of quantization which is zero-point quantization that consists in transforming the parameter values of the weights linearly, but this has the problem of the degradation of the language models from the moment they exceed 2.7B parameters.

llm.int8()-degradation

Vector quantization

Since the quantization of all the parameters of the models produces errors in the large language models, what they propose in the paper llm.int8() is to perform vector quantization, that is, to separate the matrices of the weights into vectors, so that some of these vectors can be quantized in 8 bits, while others cannot. So those that can be quantized in 8 bits are quantized and matrix multiplications are performed in INT8 format, while those vectors that cannot be quantized are kept in FP16 format and multiplications are performed in FP16 format.

Let's see it with an example

Suppose we have the matrix

llm.int8()-A

and we want to multiply it by the matrix

llm.int8()-B

We set a threshold value and all the columns of the first matrix that have a value greater than that threshold are left in FP16 format, the rows equivalent to the rows of the first matrix, in the second matrix are also left in FP16 format.

Let me explain it more clearly, as the second and fourth columns of the first matrix (yellow columns) have values greater than a certain threshold, then the second and fourth rows of the second matrix (yellow rows) are left in FP16 format.

In case of having threshold values in the second matrix the same would be done, for example, if in the second matrix a row had a value greater than a threshold, it would be left in FP16 format, and that column in the first matrix would be left in FP16 format.

The remaining rows and columns that are not left in FP16 format are quantized in 8 bits and multiplications are performed in INT8 format.

So we separate the first matrix into the two matrices

llm.int8()-A_separated

And the second matrix in the two matrices

llm.int8()-B_separated

We multiply the matrices in INT8 on one side

llm.int8()-AxB-int8

and those in FP16 format on the other hand

llm.int8()-AxB-fp16

As you can see, multiplying the matrices in INT8 format gives us a matrix of size 3x2, and multiplying the matrices in FP16 format gives us another matrix of size 3x2, so if we add them together

llm.int8()-fp16+int8

Interestingly, it gives us the same result as if we had multiplied the original matrices

llm.int8()-AxB

In order to see why this happens, if we develop the vector product of the two original matrices

llm.int8()-AxB-explained

We see that the separation we have made does not cause problems.

Therefore, we can conclude that we can separate rows and columns of matrices to perform matrix multiplications. This separation will be done when any element of the row or column is greater than a threshold value, so that the rows or columns that do not have a value greater than that threshold will be coded in INT8 occupying only one byte and the rows or columns that have an element greater than that threshold will be passed to FP16 occupying 2 bytes. In this way we will not have rounding problems, since the calculations we do in INT8 will be done with values that do not exceed the range of 8 bits.

Threshold value α

As we have said we are going to separate into rows and columns that have some element greater than a threshold value, but what threshold value should we choose? The authors of the paper did experiments with several values and determined that this threshold value should be α=6. Above that value they started to get degradations in the language models.

Use of llm.int8()

Let's see how to quantize a model with llm.int8() with the transformers library. To do this you must have bitsandbytes installed.

pip install bitsandbytes
      ```

We load a 1B parameter model twice, once normally and the second time quantizing it with llm.int8().

	
		from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
checkpoint = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)
model_8bit = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto", load_in_8bit=True)

	
		The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.

Let's see how much memory each model takes up

	
		model.get_memory_footprint()/(1024**3), model_8bit.get_memory_footprint()/(1024**3)

	
		(4.098002195358276, 1.1466586589813232)

As can be seen, the quantized model occupies much less memory.

Let's now do a text generation test with the two models

	
		input_tokens = tokenizer("Hello my name is Maximo and I am a Machine Learning Engineer", return_tensors="pt").to(device)
input_tokens.input_ids

	
		tensor([[    1, 15043,   590,  1024,   338,  5918,  4200,   322,   306,   626,
           263,  6189, 29257, 10863,   261]], device='cuda:0')

We see the output with the normal model

	
		import time
t0 = time.time()
max_new_tokens = 50
outputs = model.generate(
    input_ids=input_tokens.input_ids,
    attention_mask=input_tokens.attention_mask,
    max_length=input_tokens.input_ids.shape[1] + max_new_tokens,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
print(time.time() - t0)

	
		Hello my name is Maximo and I am a Machine Learning Engineer. I am currently working at [Company Name] as a Machine Learning Engineer. I have a Bachelor's degree in Computer Science from [University Name] and a Master's degree in Computer Science from [University Name]. I
1.7616662979125977

And now with the quantized model

	
		t0 = time.time()
max_new_tokens = 50
outputs = model_8bit.generate(
    input_ids=input_tokens.input_ids,
    attention_mask=input_tokens.attention_mask,
    max_length=input_tokens.input_ids.shape[1] + max_new_tokens,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
print(time.time() - t0)

	
		Hello my name is Maximo and I am a Machine Learning Engineer. I am currently working at [Company Name] as a Machine Learning Engineer. I have a Bachelor's degree in Computer Science from [University Name] and a Master's degree in Computer Science from [University Name]. I
9.100712776184082

We see two things, on the one hand we get the same text at the output, so with a much smaller model we can get the same output, however the quantized model takes much longer to run, so if you need to use this model in real time it would not be advisable.

This is contradictory, because we might think that a smaller model would have to run faster, but we have to think that in reality the two models, the normal and the quantized, perform the same operations, only that one performs all the operations in FP32 and the other does them in INT8 and FP16, however the quantized model has to search for rows and columns with values greater than the threshold value, separate them, perform the operations in INT8 and FP16 and then put the results back together, so the quantized model takes longer to run.

Continue reading

Create virtual environments with uv

Learn how to create virtual environments with uv, a package manager and environment for Python written in Rust, which makes it very fast. If you have had problems with the waiting times using conda, or want a faster and easier alternative to venv, enter and see how to use uv.

Make a real-time AI application with FastRTC

If you have problems making a real-time AI application, FastRTC can help you. In this post I explain how to use it.

Generate videos with Wan2.1-T2V-14B and Inference Providers

Do you want to have your own Sora, but also generate good videos? In this post I explain how to do it with HuggingFace Inference Providers and Replicate.

Last posts -->

Have you seen these projects?

Horeca chatbot

Naviground

Subtify

View all projects -->

Do you want to apply AI in your project? Contact me!

Do you want to improve with these tips?

Memory profiler

See the memory usage of a script

DataLoader with pin_memory and num_workers

Increase DataLoader performance with pin_memory and num_workers

py-smi

Python library to get GPU data like `nvidia-smi`

Last tips -->

Use this locally

Hugging Face spaces allow us to run models with very simple demos, but what if the demo breaks? Or if the user deletes it? That's why I've created docker containers with some interesting spaces, to be able to use them locally, whatever happens. In fact, if you click on any project view button, it may take you to a space that doesn't work.