Tokens

08 of december of 2023

Tokens

Now that LLLMs are on the rise, we keep hearing about the number of tokenss supported by each model, but what are tokenss? They are the minimum units of representation of words.

This notebook has been automatically translated to make it accessible to more people, please let me know if you see any typos.

To explain what tokenss are, lets first see it with a practical example, lets use the OpenAI tokenizer, called tiktoken.

So, first we install the package:

pip install tiktoken
      ```

Once installed we create a tokenizer using the cl100k_base model, which in the example notebook How to count tokens with tiktoken explains that it is the one used by the gpt-4, gpt-3.5-turbo and text-embedding-ada-002 models.

	
		import tiktoken
      
      encoder = tiktoken.get_encoding("cl100k_base")

Now we create a sample word tara tokenize it

	
		import tiktoken
      
      encoder = tiktoken.get_encoding("cl100k_base")
example_word = "breakdown"

And we tokenize it

	
		import tiktoken
encoder = tiktoken.get_encoding("cl100k_base")
example_word = "breakdown"
tokens = encoder.encode(example_word)
tokens

	
		[9137, 2996]

The word has been divided into 2 tokens, the 9137 and the 2996. Let`s see which words they correspond to

	
		word1 = encoder.decode([tokens[0]])
word2 = encoder.decode([tokens[1]])
word1, word2

	
		('break', 'down')

The OpenAI tokenizer has split the word breakdown into the words break and down. That is, it has split the word into 2 simpler words.

This is important, because when it is said that a LLM supports x tokens it does not mean that it supports x words, but that it supports x minimum units of word representation.

If you have a text and want to see the number of tokens it has for the OpenAI tokenizer, you can see it on the Tokenizer page, which shows each token in a different color.

tokenizer

We have seen the OpenAI tokenizer, but each LLM will be able to use another one.

As we have said, tokenss are the minimum units of representation of words, so lets see how many different tokenstiktoken` has.

	
		n_vocab = encoder.n_vocab
print(f"Vocab size: {n_vocab}")

	
		Vocab size: 100277

Let's see how tokenize other types of words

	
		def encode_decode(word):
          tokens = encoder.encode(word)
          decode_tokens = []
          for token in tokens:
              decode_tokens.append(encoder.decode([token]))
          return tokens, decode_tokens

	
		def encode_decode(word):
    tokens = encoder.encode(word)
    decode_tokens = []
    for token in tokens:
        decode_tokens.append(encoder.decode([token]))
    return tokens, decode_tokens
word = "dog"
tokens, decode_tokens = encode_decode(word)
print(f"Word: {word} ==> tokens: {tokens}, decode_tokens: {decode_tokens}")
word = "tomorrow..."
tokens, decode_tokens = encode_decode(word)
print(f"Word: {word} ==> tokens: {tokens}, decode_tokens: {decode_tokens}")
word = "artificial intelligence"
tokens, decode_tokens = encode_decode(word)
print(f"Word: {word} ==> tokens: {tokens}, decode_tokens: {decode_tokens}")
word = "Python"
tokens, decode_tokens = encode_decode(word)
print(f"Word: {word} ==> tokens: {tokens}, decode_tokens: {decode_tokens}")
word = "12/25/2023"
tokens, decode_tokens = encode_decode(word)
print(f"Word: {word} ==> tokens: {tokens}, decode_tokens: {decode_tokens}")
word = "😊"
tokens, decode_tokens = encode_decode(word)
print(f"Word: {word} ==> tokens: {tokens}, decode_tokens: {decode_tokens}")

	
		Word: dog ==> tokens: [18964], decode_tokens: ['dog']
Word: tomorrow... ==> tokens: [38501, 7924, 1131], decode_tokens: ['tom', 'orrow', '...']
Word: artificial intelligence ==> tokens: [472, 16895, 11478], decode_tokens: ['art', 'ificial', ' intelligence']
Word: Python ==> tokens: [31380], decode_tokens: ['Python']
Word: 12/25/2023 ==> tokens: [717, 14, 914, 14, 2366, 18], decode_tokens: ['12', '/', '25', '/', '202', '3']
Word: 😊 ==> tokens: [76460, 232], decode_tokens: ['�', '�']

Finally, let's look at it with words in another language

	
		word = "perro"
tokens, decode_tokens = encode_decode(word)
print(f"Word: {word} ==> tokens: {tokens}, decode_tokens: {decode_tokens}")
word = "perra"
tokens, decode_tokens = encode_decode(word)
print(f"Word: {word} ==> tokens: {tokens}, decode_tokens: {decode_tokens}")
word = "mañana..."
tokens, decode_tokens = encode_decode(word)
print(f"Word: {word} ==> tokens: {tokens}, decode_tokens: {decode_tokens}")
word = "inteligencia artificial"
tokens, decode_tokens = encode_decode(word)
print(f"Word: {word} ==> tokens: {tokens}, decode_tokens: {decode_tokens}")
word = "Python"
tokens, decode_tokens = encode_decode(word)
print(f"Word: {word} ==> tokens: {tokens}, decode_tokens: {decode_tokens}")
word = "12/25/2023"
tokens, decode_tokens = encode_decode(word)
print(f"Word: {word} ==> tokens: {tokens}, decode_tokens: {decode_tokens}")
word = "😊"
tokens, decode_tokens = encode_decode(word)
print(f"Word: {word} ==> tokens: {tokens}, decode_tokens: {decode_tokens}")

	
		Word: perro ==> tokens: [716, 299], decode_tokens: ['per', 'ro']
Word: perra ==> tokens: [79, 14210], decode_tokens: ['p', 'erra']
Word: mañana... ==> tokens: [1764, 88184, 1131], decode_tokens: ['ma', 'ñana', '...']
Word: inteligencia artificial ==> tokens: [396, 39567, 8968, 21075], decode_tokens: ['int', 'elig', 'encia', ' artificial']
Word: Python ==> tokens: [31380], decode_tokens: ['Python']
Word: 12/25/2023 ==> tokens: [717, 14, 914, 14, 2366, 18], decode_tokens: ['12', '/', '25', '/', '202', '3']
Word: 😊 ==> tokens: [76460, 232], decode_tokens: ['�', '�']

We can see that for similar words, more tokens are generated in Spanish than in English, so that for the same text, with a similar number of words, the number of tokens will be higher in Spanish than in English.

Continue reading

RAG fundamentals

Forget about Ctrl+F! 🤯 With RAG, your documents will answer your questions directly. 😎 Step-by-step tutorial with Hugging Face and ChromaDB. Unleash the power of AI (and show off to your friends)! 💪

Conventional commits

😠 Are your commits written in alien language? 👽 Join the club! 😅 Learn Conventional Commits in Python and stop torturing your team with cryptic messages. git-changelog and commitizen will be your new best friends. 🤝

DoLa – Decoding by Contrasting Layers Improves Factuality in Large Language Models

Have you ever talked to an LLM and they answered you something that sounds like they've been drinking machine coffee all night long 😂 That's what we call a hallucination in the LLM world! But don't worry, because it's not that your language model is crazy (although it can sometimes seem that way 🤪). The truth is that LLMs can be a bit... creative when it comes to generating text. But thanks to DoLa, a method that uses contrast layers to improve the feasibility of LLMs, we can keep our language models from turning into science fiction writers 😂. In this post, I'll explain how DoLa works and show you a code example so you can better understand how to make your LLMs more reliable and less prone to making up stories. Let's save our LLMs from insanity and make them more useful! 🚀

Last posts -->

Have you seen these projects?

Subtify

GPU monitor

Alfred

View all projects -->

Do you want to apply AI in your project? Contact me!

Do you want to improve with these tips?

Memory Calculator

Space to calculate the memory needed to run a model

Code formatter Black

Your colleague Patric is writing code that is hard to read? Share with him this code formatter that I show you in this post! Come in and learn how to format code to make it more understandable. We are not going to solve Patric's problems, but at least you won't suffer when reading it

Declarar redes neuronales de forma clara

Declare neural networks clearly in Pytorch

Last tips -->

Use this locally

Hugging Face spaces allow us to run models with very simple demos, but what if the demo breaks? Or if the user deletes it? That's why I've created docker containers with some interesting spaces, to be able to use them locally, whatever happens. In fact, if you click on any project view button, it may take you to a space that doesn't work.