BPE tokenizer

04 of june of 2024

BPE tokenizer

This notebook has been automatically translated to make it accessible to more people, please let me know if you see any typos.

The BPE (Byte Pair Encoding) tokenizer is a data compression algorithm used to create a vocabulary of subwords from a corpus of text. This algorithm is based on the frequency of byte pairs in the text. It became popular because it was used as a tokenizer by LLMs such as GPT, GPT-2, RoBERTa, BART and DeBERTa.

Training algorithm

Suppose we have a text corpus containing only the following words hug, pug, pun, bun and hugs, the first step is to create a vocabulary with all the characters present in the corpus, in our case it will be b, g, h, n, p, s, u.

	
		corpus_words = ["hug", "pug", "pun", "bun", "hugs"]
# Concatenate all the words in the corpus
initial_corpus_tokens = ""
for word in corpus_words:
    initial_corpus_tokens += word
# Convert the concatenated string to a set of tokens to get unique tokens
initial_corpus_tokens = set(initial_corpus_tokens)
print(f"Corpus words: {corpus_words}")
print(f"Initial corpus tokens: {initial_corpus_tokens}")
print(f"Number of initial corpus tokens: {len(initial_corpus_tokens)}")

	
		Corpus words: ['hug', 'pug', 'pun', 'bun', 'hugs']
Initial corpus tokens: {'p', 'n', 'u', 's', 'h', 'g', 'b'}
Number of initial corpus tokens: 7

Now let's suppose that this is our corpus of sentences, it is an invented corpus, it does not make sense

	
		corpus = [
          "hug hug hug pun pun bun hugs",
          "hug hug pug pug pun pun hugs",
          "hug hug pug pug pun pun pun pun hugs",
          "pug pun pun pun bun hugs",
          "hug hug hug pun bun bun hugs",
      ]

Let's count the number of times each word appears in the corpus, to check that what we had put before is correct.

	
		corpus = [
    "hug hug hug pun pun bun hugs",
    "hug hug pug pug pun pun hugs",
    "hug hug pug pug pun pun pun pun hugs",
    "pug pun pun pun bun hugs",
    "hug hug hug pun bun bun hugs",
]
num_hug = 0
num_pug = 0
num_pun = 0
num_bun = 0
num_hugs = 0
for sentence in corpus:
    words = sentence.split(" ")
    for word in words:
        if word == "hug":
            num_hug += 1
        elif word == "pug":
            num_pug += 1
        elif word == "pun":
            num_pun += 1
        elif word == "bun":
            num_bun += 1
        elif word == "hugs":
            num_hugs += 1
print(f"Number of hug: {num_hug}")
print(f"Number of pug: {num_pug}")
print(f"Number of pun: {num_pun}")
print(f"Number of bun: {num_bun}")
print(f"Number of hugs: {num_hugs}")

	
		Number of hug: 10
Number of pug: 5
Number of pun: 12
Number of bun: 4
Number of hugs: 5

Everything we have told you is fine, we can go on

We will create a dictionary with the tokens of each word and the number of times it appears in the corpus.

	
		dict_tokens_by_word_appearance = {
    "hug":
        {
            "count": num_hug,
            "tokens": [character for character in "hug"],
        },
    "pug":
        {
            "count": num_pug,
            "tokens": [character for character in "pug"],
        },
    "pun":
        {
            "count": num_pun,
            "tokens": [character for character in "pun"],
        },
    "bun":
        {
            "count": num_bun,
            "tokens": [character for character in "bun"],
        },
    "hugs":
        {
            "count": num_hugs,
            "tokens": [character for character in "hugs"],
        },
}
dict_tokens_by_word_appearance

	
		{'hug': {'count': 10, 'tokens': ['h', 'u', 'g']},
 'pug': {'count': 5, 'tokens': ['p', 'u', 'g']},
 'pun': {'count': 12, 'tokens': ['p', 'u', 'n']},
 'bun': {'count': 4, 'tokens': ['b', 'u', 'n']},
 'hugs': {'count': 5, 'tokens': ['h', 'u', 'g', 's']}}

Now let's look for the pair of consecutive tokens that appears most often in the dictionary

	
		dict_keys = dict_tokens_by_word_appearance.keys()
list_consecutive_tokens = []
for i, key in enumerate(dict_keys):
    # Get the tokens of the word
    number_of_toneks_of_word = len(dict_tokens_by_word_appearance[key]["tokens"])
    # Get consecituve tokens
    for j in range(number_of_toneks_of_word-1):
        # Get consecutive tokens
        consecutive_tokens = dict_tokens_by_word_appearance[key]["tokens"][j] + dict_tokens_by_word_appearance[key]["tokens"][j+1]
        # Append the consecutive tokens to the list the number of times the word appears
        for _ in range(dict_tokens_by_word_appearance[key]["count"]):
            list_consecutive_tokens.append(consecutive_tokens)
# Print the list of consecutive tokens
print(f"List of consecutive tokens: {list_consecutive_tokens}")
# Get consecutive tokens with maximum frequency
dict_consecutive_tokens = {}
for token in list_consecutive_tokens:
    # Check if the token is already in the dictionary
    if token in dict_consecutive_tokens:
        # Increment the count of the token
        dict_consecutive_tokens[token] += 1
    
    # If the token is not in the dictionary
    else:
        # Add the token to the dictionary
        dict_consecutive_tokens[token] = 1
# Print the dictionary of consecutive tokens
print(f"Dictionary of consecutive tokens: {dict_consecutive_tokens}")
# Get the consecutive token with maximum frequency
max_consecutive_token = None
while True:
    # Get the token with maximum frequency
    consecutive_token = max(dict_consecutive_tokens, key=dict_consecutive_tokens.get)
    # Check if the token is already in the list of tokens
    if consecutive_token in initial_corpus_tokens:
        # Remove token from the dictionary
        dict_consecutive_tokens.pop(consecutive_token)
    # If the token is not in the list of tokens
    else:
        # Assign the token to the max_consecutive_token
        max_consecutive_token = consecutive_token
        break
# Print the consecutive token with maximum frequency
print(f"Consecutive token with maximum frequency: {max_consecutive_token}")

	
		List of consecutive tokens: ['hu', 'hu', 'hu', 'hu', 'hu', 'hu', 'hu', 'hu', 'hu', 'hu', 'ug', 'ug', 'ug', 'ug', 'ug', 'ug', 'ug', 'ug', 'ug', 'ug', 'pu', 'pu', 'pu', 'pu', 'pu', 'ug', 'ug', 'ug', 'ug', 'ug', 'pu', 'pu', 'pu', 'pu', 'pu', 'pu', 'pu', 'pu', 'pu', 'pu', 'pu', 'pu', 'un', 'un', 'un', 'un', 'un', 'un', 'un', 'un', 'un', 'un', 'un', 'un', 'bu', 'bu', 'bu', 'bu', 'un', 'un', 'un', 'un', 'hu', 'hu', 'hu', 'hu', 'hu', 'ug', 'ug', 'ug', 'ug', 'ug', 'gs', 'gs', 'gs', 'gs', 'gs']
Dictionary of consecutive tokens: {'hu': 15, 'ug': 20, 'pu': 17, 'un': 16, 'bu': 4, 'gs': 5}
Consecutive token with maximum frequency: ug

We have obtained the pair of tokens that appears the most times. We are going to encapsulate this in a function because we are going to use it more times.

	
		def get_consecutive_tokens_with_max_frequency(dict_tokens_by_word_appearance, list_corpus_tokens):
          dict_keys = dict_tokens_by_word_appearance.keys()
      
          list_consecutive_tokens = []
          for i, key in enumerate(dict_keys):
              # Get the tokens of the word
              number_of_toneks_of_word = len(dict_tokens_by_word_appearance[key]["tokens"])
      
              # Get consecituve tokens
              for j in range(number_of_toneks_of_word-1):
                  # Get consecutive tokens
                  consecutive_tokens = dict_tokens_by_word_appearance[key]["tokens"][j] + dict_tokens_by_word_appearance[key]["tokens"][j+1]
                  # Append the consecutive tokens to the list
                  for _ in range(dict_tokens_by_word_appearance[key]["count"]):
                      list_consecutive_tokens.append(consecutive_tokens)
      
          # Get consecutive tokens with maximum frequency
          dict_consecutive_tokens = {}
          for token in list_consecutive_tokens:
              # Check if the token is already in the dictionary
              if token in dict_consecutive_tokens:
                  # Increment the count of the token
                  dict_consecutive_tokens[token] += 1
              
              # If the token is not in the dictionary
              else:
                  # Add the token to the dictionary
                  dict_consecutive_tokens[token] = 1
      
          # Get the consecutive token with maximum frequency
          max_consecutive_token = None
          while True:
              # Get the token with maximum frequency
              consecutive_token = max(dict_consecutive_tokens, key=dict_consecutive_tokens.get)
      
              # Check if the token is already in the list of tokens
              if consecutive_token in list_corpus_tokens:
                  # Remove token from the dictionary
                  dict_consecutive_tokens.pop(consecutive_token)
      
              # If the token is not in the list of tokens
              else:
                  # Assign the token to the max_consecutive_token
                  max_consecutive_token = consecutive_token
                  break
      
          return max_consecutive_token

We check that we get the same as before

	
		def get_consecutive_tokens_with_max_frequency(dict_tokens_by_word_appearance, list_corpus_tokens):
    dict_keys = dict_tokens_by_word_appearance.keys()
    list_consecutive_tokens = []
    for i, key in enumerate(dict_keys):
        # Get the tokens of the word
        number_of_toneks_of_word = len(dict_tokens_by_word_appearance[key]["tokens"])
        # Get consecituve tokens
        for j in range(number_of_toneks_of_word-1):
            # Get consecutive tokens
            consecutive_tokens = dict_tokens_by_word_appearance[key]["tokens"][j] + dict_tokens_by_word_appearance[key]["tokens"][j+1]
            # Append the consecutive tokens to the list
            for _ in range(dict_tokens_by_word_appearance[key]["count"]):
                list_consecutive_tokens.append(consecutive_tokens)
    # Get consecutive tokens with maximum frequency
    dict_consecutive_tokens = {}
    for token in list_consecutive_tokens:
        # Check if the token is already in the dictionary
        if token in dict_consecutive_tokens:
            # Increment the count of the token
            dict_consecutive_tokens[token] += 1
        
        # If the token is not in the dictionary
        else:
            # Add the token to the dictionary
            dict_consecutive_tokens[token] = 1
    # Get the consecutive token with maximum frequency
    max_consecutive_token = None
    while True:
        # Get the token with maximum frequency
        consecutive_token = max(dict_consecutive_tokens, key=dict_consecutive_tokens.get)
        # Check if the token is already in the list of tokens
        if consecutive_token in list_corpus_tokens:
            # Remove token from the dictionary
            dict_consecutive_tokens.pop(consecutive_token)
        # If the token is not in the list of tokens
        else:
            # Assign the token to the max_consecutive_token
            max_consecutive_token = consecutive_token
            break
    return max_consecutive_token
max_consecutive_token = get_consecutive_tokens_with_max_frequency(dict_tokens_by_word_appearance, initial_corpus_tokens)
print(f"Consecutive token with maximum frequency: {max_consecutive_token}")

	
		Consecutive token with maximum frequency: ug

We see that yes

Now our token corpus can be modified by adding the token ug.

	
		# new_corpus_tokens = initial_corpus_tokens + max_consecutive_token
new_corpus_tokens = initial_corpus_tokens.copy()
new_corpus_tokens.add(max_consecutive_token)
print(f"Initial corpus tokens: {initial_corpus_tokens}")
print(f"New corpus tokens: {new_corpus_tokens}")

	
		Initial corpus tokens: {'p', 'n', 'u', 's', 'h', 'g', 'b'}
New corpus tokens: {'p', 'n', 'ug', 'g', 'b', 'u', 's', 'h'}

We also put this in a function

	
		def get_new_corpus_tokens(max_consecutive_token, initial_corpus_tokens):
          new_corpus_tokens = initial_corpus_tokens.copy()
          new_corpus_tokens.add(max_consecutive_token)
          return new_corpus_tokens

We will check again that we obtain the same as before

	
		def get_new_corpus_tokens(max_consecutive_token, initial_corpus_tokens):
    new_corpus_tokens = initial_corpus_tokens.copy()
    new_corpus_tokens.add(max_consecutive_token)
    return new_corpus_tokens
new_corpus_tokens = get_new_corpus_tokens(max_consecutive_token, initial_corpus_tokens)
print(f"Initial corpus tokens: {initial_corpus_tokens}")
print(f"New corpus tokens: {new_corpus_tokens}")

	
		Initial corpus tokens: {'p', 'n', 'u', 's', 'h', 'g', 'b'}
New corpus tokens: {'p', 'n', 'ug', 'g', 'b', 'u', 's', 'h'}

We see that yes

Now we are going to modify the dictionary where the words appear, the tokens and the number of times they appear with the new token

import copy
      
      dict_keys = dict_tokens_by_word_appearance.keys()
      dict_tokens_by_word_appearance_tmp = copy.deepcopy(dict_tokens_by_word_appearance)
      
      for key in dict_keys:
          # Check if the new token is in the word
          if max_consecutive_token in key:
              print(f"Token {max_consecutive_token} is in the word {key}")
      
              # Add the new token to the word tokens
              dict_tokens_by_word_appearance_tmp[key]["tokens"].append(max_consecutive_token)
      
              print(f"New tokens of the word {key}: {dict_tokens_by_word_appearance_tmp[key]['tokens']}")
      
      print(f"Initial tokens by word appearance: {dict_tokens_by_word_appearance}")
      print(f"New tokens by word appearance: ")
      dict_tokens_by_word_appearance_tmp

Token ug is in the word hug
      New tokens of the word hug: ['h', 'u', 'g', 'ug']
      Token ug is in the word pug
      New tokens of the word pug: ['p', 'u', 'g', 'ug']
      Token ug is in the word hugs
      New tokens of the word hugs: ['h', 'u', 'g', 's', 'ug']
      Initial tokens by word appearance: {'hug': {'count': 10, 'tokens': ['h', 'u', 'g']}, 'pug': {'count': 5, 'tokens': ['p', 'u', 'g']}, 'pun': {'count': 12, 'tokens': ['p', 'u', 'n']}, 'bun': {'count': 4, 'tokens': ['b', 'u', 'n']}, 'hugs': {'count': 5, 'tokens': ['h', 'u', 'g', 's']}}
      New tokens by word appearance:

Out[32]:

{'hug': {'count': 10, 'tokens': ['h', 'u', 'g', 'ug']},
       'pug': {'count': 5, 'tokens': ['p', 'u', 'g', 'ug']},
       'pun': {'count': 12, 'tokens': ['p', 'u', 'n']},
       'bun': {'count': 4, 'tokens': ['b', 'u', 'n']},
       'hugs': {'count': 5, 'tokens': ['h', 'u', 'g', 's', 'ug]']}},

We put this in a function

	
		import copy
      
      dict_keys = dict_tokens_by_word_appearance.keys()
      dict_tokens_by_word_appearance_tmp = copy.deepcopy(dict_tokens_by_word_appearance)
      
      for key in dict_keys:
          # Check if the new token is in the word
          if max_consecutive_token in key:
              print(f"Token {max_consecutive_token} is in the word {key}")
      
              # Add the new token to the word tokens
              dict_tokens_by_word_appearance_tmp[key]["tokens"].append(max_consecutive_token)
      
              print(f"New tokens of the word {key}: {dict_tokens_by_word_appearance_tmp[key]['tokens']}")
      
      print(f"Initial tokens by word appearance: {dict_tokens_by_word_appearance}")
      print(f"New tokens by word appearance: ")
      dict_tokens_by_word_appearance_tmp
def update_tokens_by_word_appearance(dict_tokens_by_word_appearance, max_consecutive_token):
          dict_tokens_by_word_appearance_tmp = copy.deepcopy(dict_tokens_by_word_appearance)
          dict_keys = dict_tokens_by_word_appearance_tmp.keys()
      
          for key in dict_keys:
              # Check if the new token is in the word
              if max_consecutive_token in key:
                  # Add the new token to the word tokens
                  dict_tokens_by_word_appearance_tmp[key]["tokens"].append(max_consecutive_token)
      
          return dict_tokens_by_word_appearance_tmp

We check that it is OK

dict_tokens_by_word_appearance = update_tokens_by_word_appearance(dict_tokens_by_word_appearance, max_consecutive_token)
      print(f"New tokens by word appearance: ")
      dict_tokens_by_word_appearance

New tokens by word appearance:

Out[34]:

{'hug': {'count': 10, 'tokens': ['h', 'u', 'g', 'ug']},
       'pug': {'count': 5, 'tokens': ['p', 'u', 'g', 'ug']},
       'pun': {'count': 12, 'tokens': ['p', 'u', 'n']},
       'bun': {'count': 4, 'tokens': ['b', 'u', 'n']},
       'hugs': {'count': 5, 'tokens': ['h', 'u', 'g', 's', 'ug]']}},

In summary, in a first iteration, we have gone from a corpus of tokens s, g, h, u, n, p, b to the new corpus of tokens h, u, n, p, s, g, b, ug.

We now perform a second iteration, obtaining the pair of consecutive tokens that appears most often in the dictionary.

	
		import copy
dict_keys = dict_tokens_by_word_appearance.keys()
dict_tokens_by_word_appearance_tmp = copy.deepcopy(dict_tokens_by_word_appearance)
for key in dict_keys:
    # Check if the new token is in the word
    if max_consecutive_token in key:
        print(f"Token {max_consecutive_token} is in the word {key}")
        # Add the new token to the word tokens
        dict_tokens_by_word_appearance_tmp[key]["tokens"].append(max_consecutive_token)
        print(f"New tokens of the word {key}: {dict_tokens_by_word_appearance_tmp[key]['tokens']}")
print(f"Initial tokens by word appearance: {dict_tokens_by_word_appearance}")
print(f"New tokens by word appearance: ")
dict_tokens_by_word_appearance_tmp
def update_tokens_by_word_appearance(dict_tokens_by_word_appearance, max_consecutive_token):
    dict_tokens_by_word_appearance_tmp = copy.deepcopy(dict_tokens_by_word_appearance)
    dict_keys = dict_tokens_by_word_appearance_tmp.keys()
    for key in dict_keys:
        # Check if the new token is in the word
        if max_consecutive_token in key:
            # Add the new token to the word tokens
            dict_tokens_by_word_appearance_tmp[key]["tokens"].append(max_consecutive_token)
    return dict_tokens_by_word_appearance_tmp
dict_tokens_by_word_appearance = update_tokens_by_word_appearance(dict_tokens_by_word_appearance, max_consecutive_token)
print(f"New tokens by word appearance: ")
dict_tokens_by_word_appearance
max_consecutive_token = get_consecutive_tokens_with_max_frequency(dict_tokens_by_word_appearance, new_corpus_tokens)
print(f"Consecutive token with maximum frequency: {max_consecutive_token}")

	
		Token ug is in the word hug
New tokens of the word hug: ['h', 'u', 'g', 'ug']
Token ug is in the word pug
New tokens of the word pug: ['p', 'u', 'g', 'ug']
Token ug is in the word hugs
New tokens of the word hugs: ['h', 'u', 'g', 's', 'ug']
Initial tokens by word appearance: {'hug': {'count': 10, 'tokens': ['h', 'u', 'g']}, 'pug': {'count': 5, 'tokens': ['p', 'u', 'g']}, 'pun': {'count': 12, 'tokens': ['p', 'u', 'n']}, 'bun': {'count': 4, 'tokens': ['b', 'u', 'n']}, 'hugs': {'count': 5, 'tokens': ['h', 'u', 'g', 's']}}
New tokens by word appearance: 
New tokens by word appearance: 
Consecutive token with maximum frequency: pu

We obtain the new corpus of tokens

	
		corpus_tokens = get_new_corpus_tokens(max_consecutive_token, new_corpus_tokens)
print(f"Initial corpus tokens: {new_corpus_tokens}")
print(f"New corpus tokens: {corpus_tokens}")

	
		Initial corpus tokens: {'p', 'n', 'ug', 'g', 'b', 'u', 's', 'h'}
New corpus tokens: {'p', 'n', 'pu', 'u', 's', 'h', 'ug', 'g', 'b'}

And we obtain the new dictionary in which the words, tokens and the number of times they appear are displayed.

dict_tokens_by_word_appearance = update_tokens_by_word_appearance(dict_tokens_by_word_appearance, max_consecutive_token)
      print(f"New tokens by word appearance: ")
      dict_tokens_by_word_appearance

New tokens by word appearance:

Out[37]:

{'hug': {'count': 10, 'tokens': ['h', 'u', 'g', 'ug']},
       'pug': {'count': 5, 'tokens': ['p', 'u', 'g', 'ug', 'pu']},
       'pun': {'count': 12, 'tokens': ['p', 'u', 'n', 'pu']},
       'bun': {'count': 4, 'tokens': ['b', 'u', 'n']},
       'hugs': {'count': 5, 'tokens': ['h', 'u', 'g', 's', 'ug]']}},

Now we can continue until we have a corpus of tokens with the size we want, let's create a corpus of 15 tokens.

	
		dict_tokens_by_word_appearance = update_tokens_by_word_appearance(dict_tokens_by_word_appearance, max_consecutive_token)
print(f"New tokens by word appearance: ")
dict_tokens_by_word_appearance
len_corpus_tokens = 15
while len(corpus_tokens) < len_corpus_tokens:
    max_consecutive_token = get_consecutive_tokens_with_max_frequency(dict_tokens_by_word_appearance, corpus_tokens)
    print(f"Consecutive token with maximum frequency: {max_consecutive_token}")
    # If there are no more consecutive tokens break the loop
    if max_consecutive_token is None:
        break
    corpus_tokens = get_new_corpus_tokens(max_consecutive_token, corpus_tokens)
    print(f"New corpus tokens: {corpus_tokens}")
    dict_tokens_by_word_appearance = update_tokens_by_word_appearance(dict_tokens_by_word_appearance, max_consecutive_token)
    print(f"New tokens by word appearance: {dict_tokens_by_word_appearance}\n")

	
		New tokens by word appearance: 
Consecutive token with maximum frequency: un
New corpus tokens: {'p', 'n', 'pu', 'un', 'u', 's', 'h', 'ug', 'g', 'b'}
New tokens by word appearance: {'hug': {'count': 10, 'tokens': ['h', 'u', 'g', 'ug']}, 'pug': {'count': 5, 'tokens': ['p', 'u', 'g', 'ug', 'pu']}, 'pun': {'count': 12, 'tokens': ['p', 'u', 'n', 'pu', 'un']}, 'bun': {'count': 4, 'tokens': ['b', 'u', 'n', 'un']}, 'hugs': {'count': 5, 'tokens': ['h', 'u', 'g', 's', 'ug']}}
Consecutive token with maximum frequency: hu
New corpus tokens: {'p', 'hu', 'n', 'pu', 'un', 'u', 's', 'h', 'ug', 'g', 'b'}
New tokens by word appearance: {'hug': {'count': 10, 'tokens': ['h', 'u', 'g', 'ug', 'hu']}, 'pug': {'count': 5, 'tokens': ['p', 'u', 'g', 'ug', 'pu']}, 'pun': {'count': 12, 'tokens': ['p', 'u', 'n', 'pu', 'un']}, 'bun': {'count': 4, 'tokens': ['b', 'u', 'n', 'un']}, 'hugs': {'count': 5, 'tokens': ['h', 'u', 'g', 's', 'ug', 'hu']}}
Consecutive token with maximum frequency: gug
New corpus tokens: {'p', 'hu', 'n', 'pu', 'un', 'gug', 'u', 's', 'h', 'ug', 'g', 'b'}
New tokens by word appearance: {'hug': {'count': 10, 'tokens': ['h', 'u', 'g', 'ug', 'hu']}, 'pug': {'count': 5, 'tokens': ['p', 'u', 'g', 'ug', 'pu']}, 'pun': {'count': 12, 'tokens': ['p', 'u', 'n', 'pu', 'un']}, 'bun': {'count': 4, 'tokens': ['b', 'u', 'n', 'un']}, 'hugs': {'count': 5, 'tokens': ['h', 'u', 'g', 's', 'ug', 'hu']}}
Consecutive token with maximum frequency: ughu
New corpus tokens: {'p', 'hu', 'n', 'pu', 'un', 'gug', 'u', 's', 'h', 'ughu', 'ug', 'g', 'b'}
New tokens by word appearance: {'hug': {'count': 10, 'tokens': ['h', 'u', 'g', 'ug', 'hu']}, 'pug': {'count': 5, 'tokens': ['p', 'u', 'g', 'ug', 'pu']}, 'pun': {'count': 12, 'tokens': ['p', 'u', 'n', 'pu', 'un']}, 'bun': {'count': 4, 'tokens': ['b', 'u', 'n', 'un']}, 'hugs': {'count': 5, 'tokens': ['h', 'u', 'g', 's', 'ug', 'hu']}}
Consecutive token with maximum frequency: npu
New corpus tokens: {'p', 'hu', 'n', 'npu', 'pu', 'un', 'gug', 'u', 's', 'h', 'ughu', 'ug', 'g', 'b'}
New tokens by word appearance: {'hug': {'count': 10, 'tokens': ['h', 'u', 'g', 'ug', 'hu']}, 'pug': {'count': 5, 'tokens': ['p', 'u', 'g', 'ug', 'pu']}, 'pun': {'count': 12, 'tokens': ['p', 'u', 'n', 'pu', 'un']}, 'bun': {'count': 4, 'tokens': ['b', 'u', 'n', 'un']}, 'hugs': {'count': 5, 'tokens': ['h', 'u', 'g', 's', 'ug', 'hu']}}
Consecutive token with maximum frequency: puun
New corpus tokens: {'p', 'hu', 'n', 'npu', 'pu', 'un', 'gug', 'puun', 'u', 's', 'h', 'ughu', 'ug', 'g', 'b'}
New tokens by word appearance: {'hug': {'count': 10, 'tokens': ['h', 'u', 'g', 'ug', 'hu']}, 'pug': {'count': 5, 'tokens': ['p', 'u', 'g', 'ug', 'pu']}, 'pun': {'count': 12, 'tokens': ['p', 'u', 'n', 'pu', 'un']}, 'bun': {'count': 4, 'tokens': ['b', 'u', 'n', 'un']}, 'hugs': {'count': 5, 'tokens': ['h', 'u', 'g', 's', 'ug', 'hu']}}

Now that we have seen how to train the BPE tokenizer, let's train it from scratch to consolidate our knowledge.

	
		corpus_words = ["hug", "pug", "pun", "bun", "hugs"]
# Concatenate all the words in the corpus
initial_corpus_tokens = ""
for word in corpus_words:
    initial_corpus_tokens += word
# Convert the concatenated string to a set of tokens to get unique tokens
corpus_tokens = set(initial_corpus_tokens)
print(f"Corpus words: {corpus_words}")
print(f"Initial corpus tokens: {corpus_tokens}")
print(f"Number of initial corpus tokens: {len(corpus_tokens)}")

	
		Corpus words: ['hug', 'pug', 'pun', 'bun', 'hugs']
Initial corpus tokens: {'p', 'n', 'u', 's', 'h', 'g', 'b'}
Number of initial corpus tokens: 7

	
		corpus = [
    "hug hug hug pun pun bun hugs",
    "hug hug pug pug pun pun hugs",
    "hug hug pug pug pun pun pun pun hugs",
    "pug pun pun pun bun hugs",
    "hug hug hug pun bun bun hugs",
]
num_hug = 0
num_pug = 0
num_pun = 0
num_bun = 0
num_hugs = 0
for sentence in corpus:
    words = sentence.split(" ")
    for word in words:
        if word == "hug":
            num_hug += 1
        elif word == "pug":
            num_pug += 1
        elif word == "pun":
            num_pun += 1
        elif word == "bun":
            num_bun += 1
        elif word == "hugs":
            num_hugs += 1
dict_tokens_by_word_appearance = {
    "hug":
        {
            "count": num_hug,
            "tokens": [character for character in "hug"],
        },
    "pug":
        {
            "count": num_pug,
            "tokens": [character for character in "pug"],
        },
    "pun":
        {
            "count": num_pun,
            "tokens": [character for character in "pun"],
        },
    "bun":
        {
            "count": num_bun,
            "tokens": [character for character in "bun"],
        },
    "hugs":
        {
            "count": num_hugs,
            "tokens": [character for character in "hugs"],
        },
}
dict_tokens_by_word_appearance

	
		{'hug': {'count': 10, 'tokens': ['h', 'u', 'g']},
 'pug': {'count': 5, 'tokens': ['p', 'u', 'g']},
 'pun': {'count': 12, 'tokens': ['p', 'u', 'n']},
 'bun': {'count': 4, 'tokens': ['b', 'u', 'n']},
 'hugs': {'count': 5, 'tokens': ['h', 'u', 'g', 's']}}

We train it from scratch to obtain a corpus of 15 tokens.

	
		len_corpus_tokens = 15
print(f"Initial corpus tokens: ({len(corpus_tokens)}) {corpus_tokens}")
while len(corpus_tokens) < len_corpus_tokens:
    max_consecutive_token = get_consecutive_tokens_with_max_frequency(dict_tokens_by_word_appearance, corpus_tokens)
    # If there are no more consecutive tokens break the loop
    if max_consecutive_token is None:
        break
    corpus_tokens = get_new_corpus_tokens(max_consecutive_token, corpus_tokens)
    dict_tokens_by_word_appearance = update_tokens_by_word_appearance(dict_tokens_by_word_appearance, max_consecutive_token)
print(f"New corpus tokens: ({len(corpus_tokens)}) {corpus_tokens}")

	
		Initial corpus tokens: (7) {'p', 'n', 'u', 's', 'h', 'g', 'b'}
New corpus tokens: (15) {'p', 'hu', 'n', 'npu', 'pu', 'un', 'gug', 'puun', 'u', 's', 'h', 'ughu', 'ug', 'g', 'b'}

Tokenization

If we now wanted to tokenize, we would first have to create a vocabulary, that is, assign an ID to each token.

vocab = {}
      for i, token in enumerate(corpus_tokens):
          vocab[token] = i
      
      print(f"Vocabulary: ")
      vocab

Vocabulary:

Out[54]:

{'p': 0,
       'hu': 1,
       'sug': 2,
       'npu': 3,
       'ugpu': 4,
       'gug': 5,
       'u': 6,
       'ug': 7,
       'ughu': 8,
       'n': 9,
       'pu': 10,
       'un': 11,
       'puun': 12,
       's': 13,
       'h': 14,
       'gs': 15,
       'g': 16,
       'b': 17}

We put it in a function

	
		vocab = {}
      for i, token in enumerate(corpus_tokens):
          vocab[token] = i
      
      print(f"Vocabulary: ")
      vocab
def get_vocabulary(corpus_tokens):
          vocab = {}
          for i, token in enumerate(corpus_tokens):
              vocab[token] = i
          return vocab

Commitments that are well

vocab = get_vocabulary(corpus_tokens)
      print(f"Vocabulary: ")
      vocab

Vocabulary:

Out[56]:

{'p': 0,
       'hu': 1,
       'sug': 2,
       'npu': 3,
       'ugpu': 4,
       'gug': 5,
       'u': 6,
       'ug': 7,
       'ughu': 8,
       'n': 9,
       'pu': 10,
       'un': 11,
       'puun': 12,
       's': 13,
       'h': 14,
       'gs': 15,
       'g': 16,
       'b': 17}

If we now want to tokenize the word bug we can do the following

	
		vocab = {}
for i, token in enumerate(corpus_tokens):
    vocab[token] = i
print(f"Vocabulary: ")
vocab
def get_vocabulary(corpus_tokens):
    vocab = {}
    for i, token in enumerate(corpus_tokens):
        vocab[token] = i
    return vocab
vocab = get_vocabulary(corpus_tokens)
print(f"Vocabulary: ")
vocab
word = 'bug'
# Get the maximum length of tokens
max_len = max(len(token) for token in vocab)
print(f"Maximum length of tokens: {max_len}")
# Create a empty list of tokens
tokens = []
while len(word) > 0:
    # Flag to check if the token is found
    found = False
    # Iterate over the maximum length of tokens from max_len to 0
    for i in range(max_len, 0, -1):
        # Get the prefix of the word
        prefix = word[:i]
        print(f"Prefix: {prefix}")
        # Check if the prefix is in the vocabulary
        if prefix in vocab:
            print(f"prefix {prefix} is in the vocabulary")
            tokens.append(prefix)
            word = word[i:]
            found = True
            break
    
    # if not found:
    #     tokens.append('<UNK>')
    #     word = word[1:]
print(f"Tokens: {tokens}")

	
		Vocabulary: 
Vocabulary: 
Maximum length of tokens: 4
Prefix: bug
Prefix: bug
Prefix: bu
Prefix: b
prefix b is in the vocabulary
Prefix: ug
prefix ug is in the vocabulary
Tokens: ['b', 'ug']

But if now we want to tokenize the word mug we could not because the m character is not in the vocabulary, so we tokenize it with the token <UNK>.

	
		word = 'mug'
# Get the maximum length of tokens
max_len = max(len(token) for token in vocab)
print(f"Maximum length of tokens: {max_len}")
# Create a empty list of tokens
tokens = []
while len(word) > 0:
    # Flag to check if the token is found
    found = False
    # Iterate over the maximum length of tokens from max_len to 0
    for i in range(max_len, 0, -1):
        # Get the prefix of the word
        prefix = word[:i]
        print(f"Prefix: {prefix}")
        # Check if the prefix is in the vocabulary
        if prefix in vocab:
            print(f"prefix {prefix} is in the vocabulary")
            tokens.append(prefix)
            word = word[i:]
            found = True
            break
    if not found:
        tokens.append('<UNK>')
        word = word[1:]
print(f"Tokens: {tokens}")

	
		Maximum length of tokens: 4
Prefix: mug
Prefix: mug
Prefix: mu
Prefix: m
Prefix: ug
prefix ug is in the vocabulary
Tokens: ['<UNK>', 'ug']

We put it in a function

	
		def tokenize_word(word, vocab):
          # Get the maximum length of tokens
          max_len = max(len(token) for token in vocab)
      
          # Create a empty list of tokens
          tokens = []
          while len(word) > 0:
              # Flag to check if the token is found
              found = False
      
              # Iterate over the maximum length of tokens from max_len to 0
              for i in range(max_len, 0, -1):
                  # Get the prefix of the word
                  prefix = word[:i]
      
                  # Check if the prefix is in the vocabulary
                  if prefix in vocab:
                      tokens.append(prefix)
                      word = word[i:]
                      found = True
                      break
      
              if not found:
                  tokens.append('<UNK>')
                  word = word[1:]
      
          return tokens

We check that it is OK

	
		def tokenize_word(word, vocab):
    # Get the maximum length of tokens
    max_len = max(len(token) for token in vocab)
    # Create a empty list of tokens
    tokens = []
    while len(word) > 0:
        # Flag to check if the token is found
        found = False
        # Iterate over the maximum length of tokens from max_len to 0
        for i in range(max_len, 0, -1):
            # Get the prefix of the word
            prefix = word[:i]
            # Check if the prefix is in the vocabulary
            if prefix in vocab:
                tokens.append(prefix)
                word = word[i:]
                found = True
                break
        if not found:
            tokens.append('<UNK>')
            word = word[1:]
    return tokens
print(f"Tokenization of the word 'bug': {tokenize_word('bug', vocab)}")
print(f"Tokenization of the word 'mug': {tokenize_word('mug', vocab)}")

	
		Tokenization of the word 'bug': ['b', 'ug']
Tokenization of the word 'mug': ['<UNK>', 'ug']

Token viewer

Now that we know how a BPE tokenizer works, let's see through the the-tokenizer-playground visualizer how the tokens of any given statement would look like

Continue reading

Agents patterns

Are your agents falling short? Elevate your AI projects with advanced patterns: ReAct, planning, multi-agents, and more. Practical guide with code!

LangGraph: Revolutionize your AI agents

🚀 Revolutionize your AI agents! 🧠 LangGraph is not just another library, it's the orchestration framework that gives you total control to build complex agents, with long-term memory and even human intervention! Say goodbye to basic chatbots, it's time to create true intelligence. Dive into this post and discover it!

Create virtual environments with uv

Learn how to create virtual environments with uv, a package manager and environment for Python written in Rust, which makes it very fast. If you have had problems with the waiting times using conda, or want a faster and easier alternative to venv, enter and see how to use uv.

Last posts -->

Have you seen these projects?

Horeca chatbot

Naviground

Subtify

View all projects -->

Do you want to apply AI in your project? Contact me!

Do you want to improve with these tips?

Memory profiler

See the memory usage of a script

DataLoader with pin_memory and num_workers

Increase DataLoader performance with pin_memory and num_workers

py-smi

Python library to get GPU data like `nvidia-smi`

Last tips -->

Use this locally

Hugging Face spaces allow us to run models with very simple demos, but what if the demo breaks? Or if the user deletes it? That's why I've created docker containers with some interesting spaces, to be able to use them locally, whatever happens. In fact, if you click on any project view button, it may take you to a space that doesn't work.