llm.int8() – 8-bit Matrix Multiplication for Transformers at Scale

llm.int8() – 8-bit Matrix Multiplication for Transformers at Scale llm.int8() – 8-bit Matrix Multiplication for Transformers at Scale

llm.int8() - 8-bit Matrix Multiplication for Transformers at Scalelink image 4

This notebook has been automatically translated to make it accessible to more people, please let me know if you see any typos.

In the post LLMs quantization we explained the importance of quantization of LLMs to save memory. We also explained that there is a way of quantization which is zero-point quantization that consists in transforming the parameter values of the weights linearly, but this has the problem of the degradation of the language models from the moment they exceed 2.7B parameters.

llm.int8()-degradation

Vector quantizationlink image 5

Since the quantization of all the parameters of the models produces errors in the large language models, what they propose in the paper llm.int8() is to perform vector quantization, that is, to separate the matrices of the weights into vectors, so that some of these vectors can be quantized in 8 bits, while others cannot. So those that can be quantized in 8 bits are quantized and matrix multiplications are performed in INT8 format, while those vectors that cannot be quantized are kept in FP16 format and multiplications are performed in FP16 format.

Let's see it with an example

Suppose we have the matrix

llm.int8()-A

and we want to multiply it by the matrix

llm.int8()-B

We set a threshold value and all the columns of the first matrix that have a value greater than that threshold are left in FP16 format, the rows equivalent to the rows of the first matrix, in the second matrix are also left in FP16 format.

Let me explain it more clearly, as the second and fourth columns of the first matrix (yellow columns) have values greater than a certain threshold, then the second and fourth rows of the second matrix (yellow rows) are left in FP16 format.

In case of having threshold values in the second matrix the same would be done, for example, if in the second matrix a row had a value greater than a threshold, it would be left in FP16 format, and that column in the first matrix would be left in FP16 format.

The remaining rows and columns that are not left in FP16 format are quantized in 8 bits and multiplications are performed in INT8 format.

So we separate the first matrix into the two matrices

llm.int8()-A_separated

And the second matrix in the two matrices

llm.int8()-B_separated

We multiply the matrices in INT8 on one side

llm.int8()-AxB-int8

and those in FP16 format on the other hand

llm.int8()-AxB-fp16

As you can see, multiplying the matrices in INT8 format gives us a matrix of size 3x2, and multiplying the matrices in FP16 format gives us another matrix of size 3x2, so if we add them together

llm.int8()-fp16+int8

Interestingly, it gives us the same result as if we had multiplied the original matrices

llm.int8()-AxB

In order to see why this happens, if we develop the vector product of the two original matrices

llm.int8()-AxB-explained

We see that the separation we have made does not cause problems.

Therefore, we can conclude that we can separate rows and columns of matrices to perform matrix multiplications. This separation will be done when any element of the row or column is greater than a threshold value, so that the rows or columns that do not have a value greater than that threshold will be coded in INT8 occupying only one byte and the rows or columns that have an element greater than that threshold will be passed to FP16 occupying 2 bytes. In this way we will not have rounding problems, since the calculations we do in INT8 will be done with values that do not exceed the range of 8 bits.

Threshold value αlink image 6

As we have said we are going to separate into rows and columns that have some element greater than a threshold value, but what threshold value should we choose? The authors of the paper did experiments with several values and determined that this threshold value should be α=6. Above that value they started to get degradations in the language models.

Use of llm.int8()link image 7

Let's see how to quantize a model with llm.int8() with the transformers library. To do this you must have bitsandbytes installed.

pip install bitsandbytes
      ```
      

We load a 1B parameter model twice, once normally and the second time quantizing it with llm.int8().

	
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
checkpoint = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)
model_8bit = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto", load_in_8bit=True)
Copy
	
The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.

Let's see how much memory each model takes up

	
model.get_memory_footprint()/(1024**3), model_8bit.get_memory_footprint()/(1024**3)
Copy
	
(4.098002195358276, 1.1466586589813232)

As can be seen, the quantized model occupies much less memory.

Let's now do a text generation test with the two models

	
input_tokens = tokenizer("Hello my name is Maximo and I am a Machine Learning Engineer", return_tensors="pt").to(device)
input_tokens.input_ids
Copy
	
tensor([[ 1, 15043, 590, 1024, 338, 5918, 4200, 322, 306, 626,
263, 6189, 29257, 10863, 261]], device='cuda:0')

We see the output with the normal model

	
import time
t0 = time.time()
max_new_tokens = 50
outputs = model.generate(
input_ids=input_tokens.input_ids,
attention_mask=input_tokens.attention_mask,
max_length=input_tokens.input_ids.shape[1] + max_new_tokens,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
print(time.time() - t0)
Copy
	
Hello my name is Maximo and I am a Machine Learning Engineer. I am currently working at [Company Name] as a Machine Learning Engineer. I have a Bachelor's degree in Computer Science from [University Name] and a Master's degree in Computer Science from [University Name]. I
1.7616662979125977

And now with the quantized model

	
t0 = time.time()
max_new_tokens = 50
outputs = model_8bit.generate(
input_ids=input_tokens.input_ids,
attention_mask=input_tokens.attention_mask,
max_length=input_tokens.input_ids.shape[1] + max_new_tokens,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
print(time.time() - t0)
Copy
	
Hello my name is Maximo and I am a Machine Learning Engineer. I am currently working at [Company Name] as a Machine Learning Engineer. I have a Bachelor's degree in Computer Science from [University Name] and a Master's degree in Computer Science from [University Name]. I
9.100712776184082

We see two things, on the one hand we get the same text at the output, so with a much smaller model we can get the same output, however the quantized model takes much longer to run, so if you need to use this model in real time it would not be advisable.

This is contradictory, because we might think that a smaller model would have to run faster, but we have to think that in reality the two models, the normal and the quantized, perform the same operations, only that one performs all the operations in FP32 and the other does them in INT8 and FP16, however the quantized model has to search for rows and columns with values greater than the threshold value, separate them, perform the operations in INT8 and FP16 and then put the results back together, so the quantized model takes longer to run.

Continue reading

DoLa – Decoding by Contrasting Layers Improves Factuality in Large Language Models

DoLa – Decoding by Contrasting Layers Improves Factuality in Large Language Models

Have you ever talked to an LLM and they answered you something that sounds like they've been drinking machine coffee all night long 😂 That's what we call a hallucination in the LLM world! But don't worry, because it's not that your language model is crazy (although it can sometimes seem that way 🤪). The truth is that LLMs can be a bit... creative when it comes to generating text. But thanks to DoLa, a method that uses contrast layers to improve the feasibility of LLMs, we can keep our language models from turning into science fiction writers 😂. In this post, I'll explain how DoLa works and show you a code example so you can better understand how to make your LLMs more reliable and less prone to making up stories. Let's save our LLMs from insanity and make them more useful! 🚀

Last posts -->

Have you seen these projects?

Subtify

Subtify Subtify

Subtitle generator for videos in the language you want. Also, it puts a different color subtitle to each person

View all projects -->

Do you want to apply AI in your project? Contact me!

Do you want to improve with these tips?

Last tips -->

Use this locally

Hugging Face spaces allow us to run models with very simple demos, but what if the demo breaks? Or if the user deletes it? That's why I've created docker containers with some interesting spaces, to be able to use them locally, whatever happens. In fact, if you click on any project view button, it may take you to a space that doesn't work.

View all containers -->

Do you want to apply AI in your project? Contact me!

Do you want to train your model with these datasets?

short-jokes-dataset

Dataset with jokes in English

opus100

Dataset with translations from English to Spanish

netflix_titles

Dataset with Netflix movies and series

View more datasets -->