BPE tokenizer

04 de junho de 2024

Tokenizador BPE

Este caderno foi traduzido automaticamente para torná-lo acessível a mais pessoas, por favor me avise se você vir algum erro de digitação..

O tokenizador BPE (Byte Pair Encoding) é um algoritmo de compactação de dados usado para criar um vocabulário de subpalavras a partir de um corpus de texto. Esse algoritmo é baseado na frequência de pares de bytes no texto. Ele se tornou popular porque foi usado como um tokenizador por LLMs como GPT, GPT-2, RoBERTa, BART e DeBERTa.

Algoritmo de treinamento

Suponha que tenhamos um corpus de texto contendo apenas as seguintes palavras hug, pug, pun, bun e hugs, a primeira etapa é criar um vocabulário com todos os caracteres presentes no corpus, que no nosso caso será b, g, h, n, p, s, u.

	
		corpus_words = ["hug", "pug", "pun", "bun", "hugs"]
# Concatenate all the words in the corpus
initial_corpus_tokens = ""
for word in corpus_words:
    initial_corpus_tokens += word
# Convert the concatenated string to a set of tokens to get unique tokens
initial_corpus_tokens = set(initial_corpus_tokens)
print(f"Corpus words: {corpus_words}")
print(f"Initial corpus tokens: {initial_corpus_tokens}")
print(f"Number of initial corpus tokens: {len(initial_corpus_tokens)}")

	
		Corpus words: ['hug', 'pug', 'pun', 'bun', 'hugs']
Initial corpus tokens: {'p', 'n', 'u', 's', 'h', 'g', 'b'}
Number of initial corpus tokens: 7

Agora, suponha que esse seja o nosso corpus de sentenças, é um corpus inventado, não faz sentido.

	
		corpus = [
          "hug hug hug pun pun bun hugs",
          "hug hug pug pug pun pun hugs",
          "hug hug pug pug pun pun pun pun hugs",
          "pug pun pun pun bun hugs",
          "hug hug hug pun bun bun hugs",
      ]

Vamos contar o número de vezes que cada palavra aparece no corpus, para verificar se o que colocamos antes está correto.

	
		corpus = [
    "hug hug hug pun pun bun hugs",
    "hug hug pug pug pun pun hugs",
    "hug hug pug pug pun pun pun pun hugs",
    "pug pun pun pun bun hugs",
    "hug hug hug pun bun bun hugs",
]
num_hug = 0
num_pug = 0
num_pun = 0
num_bun = 0
num_hugs = 0
for sentence in corpus:
    words = sentence.split(" ")
    for word in words:
        if word == "hug":
            num_hug += 1
        elif word == "pug":
            num_pug += 1
        elif word == "pun":
            num_pun += 1
        elif word == "bun":
            num_bun += 1
        elif word == "hugs":
            num_hugs += 1
print(f"Number of hug: {num_hug}")
print(f"Number of pug: {num_pug}")
print(f"Number of pun: {num_pun}")
print(f"Number of bun: {num_bun}")
print(f"Number of hugs: {num_hugs}")

	
		Number of hug: 10
Number of pug: 5
Number of pun: 12
Number of bun: 4
Number of hugs: 5

Tudo o que havíamos contado está bem, podemos continuar.

Criaremos um dicionário com os tokens de cada palavra e o número de vezes que ela aparece no corpus.

	
		dict_tokens_by_word_appearance = {
    "hug":
        {
            "count": num_hug,
            "tokens": [character for character in "hug"],
        },
    "pug":
        {
            "count": num_pug,
            "tokens": [character for character in "pug"],
        },
    "pun":
        {
            "count": num_pun,
            "tokens": [character for character in "pun"],
        },
    "bun":
        {
            "count": num_bun,
            "tokens": [character for character in "bun"],
        },
    "hugs":
        {
            "count": num_hugs,
            "tokens": [character for character in "hugs"],
        },
}
dict_tokens_by_word_appearance

	
		{'hug': {'count': 10, 'tokens': ['h', 'u', 'g']},
 'pug': {'count': 5, 'tokens': ['p', 'u', 'g']},
 'pun': {'count': 12, 'tokens': ['p', 'u', 'n']},
 'bun': {'count': 4, 'tokens': ['b', 'u', 'n']},
 'hugs': {'count': 5, 'tokens': ['h', 'u', 'g', 's']}}

Agora vamos procurar o par de tokens consecutivos que aparece com mais frequência no dicionário.

	
		dict_keys = dict_tokens_by_word_appearance.keys()
list_consecutive_tokens = []
for i, key in enumerate(dict_keys):
    # Get the tokens of the word
    number_of_toneks_of_word = len(dict_tokens_by_word_appearance[key]["tokens"])
    # Get consecituve tokens
    for j in range(number_of_toneks_of_word-1):
        # Get consecutive tokens
        consecutive_tokens = dict_tokens_by_word_appearance[key]["tokens"][j] + dict_tokens_by_word_appearance[key]["tokens"][j+1]
        # Append the consecutive tokens to the list the number of times the word appears
        for _ in range(dict_tokens_by_word_appearance[key]["count"]):
            list_consecutive_tokens.append(consecutive_tokens)
# Print the list of consecutive tokens
print(f"List of consecutive tokens: {list_consecutive_tokens}")
# Get consecutive tokens with maximum frequency
dict_consecutive_tokens = {}
for token in list_consecutive_tokens:
    # Check if the token is already in the dictionary
    if token in dict_consecutive_tokens:
        # Increment the count of the token
        dict_consecutive_tokens[token] += 1
    
    # If the token is not in the dictionary
    else:
        # Add the token to the dictionary
        dict_consecutive_tokens[token] = 1
# Print the dictionary of consecutive tokens
print(f"Dictionary of consecutive tokens: {dict_consecutive_tokens}")
# Get the consecutive token with maximum frequency
max_consecutive_token = None
while True:
    # Get the token with maximum frequency
    consecutive_token = max(dict_consecutive_tokens, key=dict_consecutive_tokens.get)
    # Check if the token is already in the list of tokens
    if consecutive_token in initial_corpus_tokens:
        # Remove token from the dictionary
        dict_consecutive_tokens.pop(consecutive_token)
    # If the token is not in the list of tokens
    else:
        # Assign the token to the max_consecutive_token
        max_consecutive_token = consecutive_token
        break
# Print the consecutive token with maximum frequency
print(f"Consecutive token with maximum frequency: {max_consecutive_token}")

	
		List of consecutive tokens: ['hu', 'hu', 'hu', 'hu', 'hu', 'hu', 'hu', 'hu', 'hu', 'hu', 'ug', 'ug', 'ug', 'ug', 'ug', 'ug', 'ug', 'ug', 'ug', 'ug', 'pu', 'pu', 'pu', 'pu', 'pu', 'ug', 'ug', 'ug', 'ug', 'ug', 'pu', 'pu', 'pu', 'pu', 'pu', 'pu', 'pu', 'pu', 'pu', 'pu', 'pu', 'pu', 'un', 'un', 'un', 'un', 'un', 'un', 'un', 'un', 'un', 'un', 'un', 'un', 'bu', 'bu', 'bu', 'bu', 'un', 'un', 'un', 'un', 'hu', 'hu', 'hu', 'hu', 'hu', 'ug', 'ug', 'ug', 'ug', 'ug', 'gs', 'gs', 'gs', 'gs', 'gs']
Dictionary of consecutive tokens: {'hu': 15, 'ug': 20, 'pu': 17, 'un': 16, 'bu': 4, 'gs': 5}
Consecutive token with maximum frequency: ug

Obtivemos o par de tokens que aparece com mais frequência. Vamos encapsular isso em uma função porque vamos usá-la com mais frequência

	
		def get_consecutive_tokens_with_max_frequency(dict_tokens_by_word_appearance, list_corpus_tokens):
          dict_keys = dict_tokens_by_word_appearance.keys()
      
          list_consecutive_tokens = []
          for i, key in enumerate(dict_keys):
              # Get the tokens of the word
              number_of_toneks_of_word = len(dict_tokens_by_word_appearance[key]["tokens"])
      
              # Get consecituve tokens
              for j in range(number_of_toneks_of_word-1):
                  # Get consecutive tokens
                  consecutive_tokens = dict_tokens_by_word_appearance[key]["tokens"][j] + dict_tokens_by_word_appearance[key]["tokens"][j+1]
                  # Append the consecutive tokens to the list
                  for _ in range(dict_tokens_by_word_appearance[key]["count"]):
                      list_consecutive_tokens.append(consecutive_tokens)
      
          # Get consecutive tokens with maximum frequency
          dict_consecutive_tokens = {}
          for token in list_consecutive_tokens:
              # Check if the token is already in the dictionary
              if token in dict_consecutive_tokens:
                  # Increment the count of the token
                  dict_consecutive_tokens[token] += 1
              
              # If the token is not in the dictionary
              else:
                  # Add the token to the dictionary
                  dict_consecutive_tokens[token] = 1
      
          # Get the consecutive token with maximum frequency
          max_consecutive_token = None
          while True:
              # Get the token with maximum frequency
              consecutive_token = max(dict_consecutive_tokens, key=dict_consecutive_tokens.get)
      
              # Check if the token is already in the list of tokens
              if consecutive_token in list_corpus_tokens:
                  # Remove token from the dictionary
                  dict_consecutive_tokens.pop(consecutive_token)
      
              # If the token is not in the list of tokens
              else:
                  # Assign the token to the max_consecutive_token
                  max_consecutive_token = consecutive_token
                  break
      
          return max_consecutive_token

Verificamos se o resultado é o mesmo de antes

	
		def get_consecutive_tokens_with_max_frequency(dict_tokens_by_word_appearance, list_corpus_tokens):
    dict_keys = dict_tokens_by_word_appearance.keys()
    list_consecutive_tokens = []
    for i, key in enumerate(dict_keys):
        # Get the tokens of the word
        number_of_toneks_of_word = len(dict_tokens_by_word_appearance[key]["tokens"])
        # Get consecituve tokens
        for j in range(number_of_toneks_of_word-1):
            # Get consecutive tokens
            consecutive_tokens = dict_tokens_by_word_appearance[key]["tokens"][j] + dict_tokens_by_word_appearance[key]["tokens"][j+1]
            # Append the consecutive tokens to the list
            for _ in range(dict_tokens_by_word_appearance[key]["count"]):
                list_consecutive_tokens.append(consecutive_tokens)
    # Get consecutive tokens with maximum frequency
    dict_consecutive_tokens = {}
    for token in list_consecutive_tokens:
        # Check if the token is already in the dictionary
        if token in dict_consecutive_tokens:
            # Increment the count of the token
            dict_consecutive_tokens[token] += 1
        
        # If the token is not in the dictionary
        else:
            # Add the token to the dictionary
            dict_consecutive_tokens[token] = 1
    # Get the consecutive token with maximum frequency
    max_consecutive_token = None
    while True:
        # Get the token with maximum frequency
        consecutive_token = max(dict_consecutive_tokens, key=dict_consecutive_tokens.get)
        # Check if the token is already in the list of tokens
        if consecutive_token in list_corpus_tokens:
            # Remove token from the dictionary
            dict_consecutive_tokens.pop(consecutive_token)
        # If the token is not in the list of tokens
        else:
            # Assign the token to the max_consecutive_token
            max_consecutive_token = consecutive_token
            break
    return max_consecutive_token
max_consecutive_token = get_consecutive_tokens_with_max_frequency(dict_tokens_by_word_appearance, initial_corpus_tokens)
print(f"Consecutive token with maximum frequency: {max_consecutive_token}")

	
		Consecutive token with maximum frequency: ug

Vemos que sim

Agora, nosso corpus de token pode ser modificado com a adição do token ug.

	
		# new_corpus_tokens = initial_corpus_tokens + max_consecutive_token
new_corpus_tokens = initial_corpus_tokens.copy()
new_corpus_tokens.add(max_consecutive_token)
print(f"Initial corpus tokens: {initial_corpus_tokens}")
print(f"New corpus tokens: {new_corpus_tokens}")

	
		Initial corpus tokens: {'p', 'n', 'u', 's', 'h', 'g', 'b'}
New corpus tokens: {'p', 'n', 'ug', 'g', 'b', 'u', 's', 'h'}

Colocamos isso também em uma função

	
		def get_new_corpus_tokens(max_consecutive_token, initial_corpus_tokens):
          new_corpus_tokens = initial_corpus_tokens.copy()
          new_corpus_tokens.add(max_consecutive_token)
          return new_corpus_tokens

Verificaremos novamente se estamos recebendo o mesmo que antes.

	
		def get_new_corpus_tokens(max_consecutive_token, initial_corpus_tokens):
    new_corpus_tokens = initial_corpus_tokens.copy()
    new_corpus_tokens.add(max_consecutive_token)
    return new_corpus_tokens
new_corpus_tokens = get_new_corpus_tokens(max_consecutive_token, initial_corpus_tokens)
print(f"Initial corpus tokens: {initial_corpus_tokens}")
print(f"New corpus tokens: {new_corpus_tokens}")

	
		Initial corpus tokens: {'p', 'n', 'u', 's', 'h', 'g', 'b'}
New corpus tokens: {'p', 'n', 'ug', 'g', 'b', 'u', 's', 'h'}

Vemos que sim

Agora vamos modificar o dicionário no qual aparecem as palavras, os tokens e o número de vezes que eles aparecem com o novo token.

import copy
      
      dict_keys = dict_tokens_by_word_appearance.keys()
      dict_tokens_by_word_appearance_tmp = copy.deepcopy(dict_tokens_by_word_appearance)
      
      for key in dict_keys:
          # Check if the new token is in the word
          if max_consecutive_token in key:
              print(f"Token {max_consecutive_token} is in the word {key}")
      
              # Add the new token to the word tokens
              dict_tokens_by_word_appearance_tmp[key]["tokens"].append(max_consecutive_token)
      
              print(f"New tokens of the word {key}: {dict_tokens_by_word_appearance_tmp[key]['tokens']}")
      
      print(f"Initial tokens by word appearance: {dict_tokens_by_word_appearance}")
      print(f"New tokens by word appearance: ")
      dict_tokens_by_word_appearance_tmp

Token ug is in the word hug
      New tokens of the word hug: ['h', 'u', 'g', 'ug']
      Token ug is in the word pug
      New tokens of the word pug: ['p', 'u', 'g', 'ug']
      Token ug is in the word hugs
      New tokens of the word hugs: ['h', 'u', 'g', 's', 'ug']
      Initial tokens by word appearance: {'hug': {'count': 10, 'tokens': ['h', 'u', 'g']}, 'pug': {'count': 5, 'tokens': ['p', 'u', 'g']}, 'pun': {'count': 12, 'tokens': ['p', 'u', 'n']}, 'bun': {'count': 4, 'tokens': ['b', 'u', 'n']}, 'hugs': {'count': 5, 'tokens': ['h', 'u', 'g', 's']}}
      New tokens by word appearance:

Out[32]:

{'hug': {'count': 10, 'tokens': ['h', 'u', 'g', 'ug']},
       'pug': {'count': 5, 'tokens': ['p', 'u', 'g', 'ug']},
       'pun': {'count': 12, 'tokens': ['p', 'u', 'n']},
       'bun': {'count': 4, 'tokens': ['b', 'u', 'n']},
       'hugs': {'count': 5, 'tokens': ['h', 'u', 'g', 's', 'ug]']}},

Colocamos isso em uma função

	
		import copy
      
      dict_keys = dict_tokens_by_word_appearance.keys()
      dict_tokens_by_word_appearance_tmp = copy.deepcopy(dict_tokens_by_word_appearance)
      
      for key in dict_keys:
          # Check if the new token is in the word
          if max_consecutive_token in key:
              print(f"Token {max_consecutive_token} is in the word {key}")
      
              # Add the new token to the word tokens
              dict_tokens_by_word_appearance_tmp[key]["tokens"].append(max_consecutive_token)
      
              print(f"New tokens of the word {key}: {dict_tokens_by_word_appearance_tmp[key]['tokens']}")
      
      print(f"Initial tokens by word appearance: {dict_tokens_by_word_appearance}")
      print(f"New tokens by word appearance: ")
      dict_tokens_by_word_appearance_tmp
def update_tokens_by_word_appearance(dict_tokens_by_word_appearance, max_consecutive_token):
          dict_tokens_by_word_appearance_tmp = copy.deepcopy(dict_tokens_by_word_appearance)
          dict_keys = dict_tokens_by_word_appearance_tmp.keys()
      
          for key in dict_keys:
              # Check if the new token is in the word
              if max_consecutive_token in key:
                  # Add the new token to the word tokens
                  dict_tokens_by_word_appearance_tmp[key]["tokens"].append(max_consecutive_token)
      
          return dict_tokens_by_word_appearance_tmp

Verificamos se está tudo bem

dict_tokens_by_word_appearance = update_tokens_by_word_appearance(dict_tokens_by_word_appearance, max_consecutive_token)
      print(f"New tokens by word appearance: ")
      dict_tokens_by_word_appearance

New tokens by word appearance:

Out[34]:

{'hug': {'count': 10, 'tokens': ['h', 'u', 'g', 'ug']},
       'pug': {'count': 5, 'tokens': ['p', 'u', 'g', 'ug']},
       'pun': {'count': 12, 'tokens': ['p', 'u', 'n']},
       'bun': {'count': 4, 'tokens': ['b', 'u', 'n']},
       'hugs': {'count': 5, 'tokens': ['h', 'u', 'g', 's', 'ug]']}},

Em resumo, em uma primeira iteração, passamos de um corpus de tokens s, g, h, u, n, p, b para o novo corpus de tokens h, u, n, p, s, g, b, ug.

Agora realizamos uma segunda iteração, obtendo o par de tokens consecutivos que aparecem com mais frequência no dicionário.

	
		import copy
dict_keys = dict_tokens_by_word_appearance.keys()
dict_tokens_by_word_appearance_tmp = copy.deepcopy(dict_tokens_by_word_appearance)
for key in dict_keys:
    # Check if the new token is in the word
    if max_consecutive_token in key:
        print(f"Token {max_consecutive_token} is in the word {key}")
        # Add the new token to the word tokens
        dict_tokens_by_word_appearance_tmp[key]["tokens"].append(max_consecutive_token)
        print(f"New tokens of the word {key}: {dict_tokens_by_word_appearance_tmp[key]['tokens']}")
print(f"Initial tokens by word appearance: {dict_tokens_by_word_appearance}")
print(f"New tokens by word appearance: ")
dict_tokens_by_word_appearance_tmp
def update_tokens_by_word_appearance(dict_tokens_by_word_appearance, max_consecutive_token):
    dict_tokens_by_word_appearance_tmp = copy.deepcopy(dict_tokens_by_word_appearance)
    dict_keys = dict_tokens_by_word_appearance_tmp.keys()
    for key in dict_keys:
        # Check if the new token is in the word
        if max_consecutive_token in key:
            # Add the new token to the word tokens
            dict_tokens_by_word_appearance_tmp[key]["tokens"].append(max_consecutive_token)
    return dict_tokens_by_word_appearance_tmp
dict_tokens_by_word_appearance = update_tokens_by_word_appearance(dict_tokens_by_word_appearance, max_consecutive_token)
print(f"New tokens by word appearance: ")
dict_tokens_by_word_appearance
max_consecutive_token = get_consecutive_tokens_with_max_frequency(dict_tokens_by_word_appearance, new_corpus_tokens)
print(f"Consecutive token with maximum frequency: {max_consecutive_token}")

	
		Token ug is in the word hug
New tokens of the word hug: ['h', 'u', 'g', 'ug']
Token ug is in the word pug
New tokens of the word pug: ['p', 'u', 'g', 'ug']
Token ug is in the word hugs
New tokens of the word hugs: ['h', 'u', 'g', 's', 'ug']
Initial tokens by word appearance: {'hug': {'count': 10, 'tokens': ['h', 'u', 'g']}, 'pug': {'count': 5, 'tokens': ['p', 'u', 'g']}, 'pun': {'count': 12, 'tokens': ['p', 'u', 'n']}, 'bun': {'count': 4, 'tokens': ['b', 'u', 'n']}, 'hugs': {'count': 5, 'tokens': ['h', 'u', 'g', 's']}}
New tokens by word appearance: 
New tokens by word appearance: 
Consecutive token with maximum frequency: pu

Obtemos o novo corpus de tokens

	
		corpus_tokens = get_new_corpus_tokens(max_consecutive_token, new_corpus_tokens)
print(f"Initial corpus tokens: {new_corpus_tokens}")
print(f"New corpus tokens: {corpus_tokens}")

	
		Initial corpus tokens: {'p', 'n', 'ug', 'g', 'b', 'u', 's', 'h'}
New corpus tokens: {'p', 'n', 'pu', 'u', 's', 'h', 'ug', 'g', 'b'}

E temos o novo dicionário mostrando as palavras, os tokens e o número de vezes que eles aparecem.

dict_tokens_by_word_appearance = update_tokens_by_word_appearance(dict_tokens_by_word_appearance, max_consecutive_token)
      print(f"New tokens by word appearance: ")
      dict_tokens_by_word_appearance

New tokens by word appearance:

Out[37]:

{'hug': {'count': 10, 'tokens': ['h', 'u', 'g', 'ug']},
       'pug': {'count': 5, 'tokens': ['p', 'u', 'g', 'ug', 'pu']},
       'pun': {'count': 12, 'tokens': ['p', 'u', 'n', 'pu']},
       'bun': {'count': 4, 'tokens': ['b', 'u', 'n']},
       'hugs': {'count': 5, 'tokens': ['h', 'u', 'g', 's', 'ug]']}},

Agora, podemos continuar até termos um corpus de tokens com o tamanho desejado. Vamos criar um corpus de 15 tokens.

	
		dict_tokens_by_word_appearance = update_tokens_by_word_appearance(dict_tokens_by_word_appearance, max_consecutive_token)
print(f"New tokens by word appearance: ")
dict_tokens_by_word_appearance
len_corpus_tokens = 15
while len(corpus_tokens) < len_corpus_tokens:
    max_consecutive_token = get_consecutive_tokens_with_max_frequency(dict_tokens_by_word_appearance, corpus_tokens)
    print(f"Consecutive token with maximum frequency: {max_consecutive_token}")
    # If there are no more consecutive tokens break the loop
    if max_consecutive_token is None:
        break
    corpus_tokens = get_new_corpus_tokens(max_consecutive_token, corpus_tokens)
    print(f"New corpus tokens: {corpus_tokens}")
    dict_tokens_by_word_appearance = update_tokens_by_word_appearance(dict_tokens_by_word_appearance, max_consecutive_token)
    print(f"New tokens by word appearance: {dict_tokens_by_word_appearance}\n")

	
		New tokens by word appearance: 
Consecutive token with maximum frequency: un
New corpus tokens: {'p', 'n', 'pu', 'un', 'u', 's', 'h', 'ug', 'g', 'b'}
New tokens by word appearance: {'hug': {'count': 10, 'tokens': ['h', 'u', 'g', 'ug']}, 'pug': {'count': 5, 'tokens': ['p', 'u', 'g', 'ug', 'pu']}, 'pun': {'count': 12, 'tokens': ['p', 'u', 'n', 'pu', 'un']}, 'bun': {'count': 4, 'tokens': ['b', 'u', 'n', 'un']}, 'hugs': {'count': 5, 'tokens': ['h', 'u', 'g', 's', 'ug']}}
Consecutive token with maximum frequency: hu
New corpus tokens: {'p', 'hu', 'n', 'pu', 'un', 'u', 's', 'h', 'ug', 'g', 'b'}
New tokens by word appearance: {'hug': {'count': 10, 'tokens': ['h', 'u', 'g', 'ug', 'hu']}, 'pug': {'count': 5, 'tokens': ['p', 'u', 'g', 'ug', 'pu']}, 'pun': {'count': 12, 'tokens': ['p', 'u', 'n', 'pu', 'un']}, 'bun': {'count': 4, 'tokens': ['b', 'u', 'n', 'un']}, 'hugs': {'count': 5, 'tokens': ['h', 'u', 'g', 's', 'ug', 'hu']}}
Consecutive token with maximum frequency: gug
New corpus tokens: {'p', 'hu', 'n', 'pu', 'un', 'gug', 'u', 's', 'h', 'ug', 'g', 'b'}
New tokens by word appearance: {'hug': {'count': 10, 'tokens': ['h', 'u', 'g', 'ug', 'hu']}, 'pug': {'count': 5, 'tokens': ['p', 'u', 'g', 'ug', 'pu']}, 'pun': {'count': 12, 'tokens': ['p', 'u', 'n', 'pu', 'un']}, 'bun': {'count': 4, 'tokens': ['b', 'u', 'n', 'un']}, 'hugs': {'count': 5, 'tokens': ['h', 'u', 'g', 's', 'ug', 'hu']}}
Consecutive token with maximum frequency: ughu
New corpus tokens: {'p', 'hu', 'n', 'pu', 'un', 'gug', 'u', 's', 'h', 'ughu', 'ug', 'g', 'b'}
New tokens by word appearance: {'hug': {'count': 10, 'tokens': ['h', 'u', 'g', 'ug', 'hu']}, 'pug': {'count': 5, 'tokens': ['p', 'u', 'g', 'ug', 'pu']}, 'pun': {'count': 12, 'tokens': ['p', 'u', 'n', 'pu', 'un']}, 'bun': {'count': 4, 'tokens': ['b', 'u', 'n', 'un']}, 'hugs': {'count': 5, 'tokens': ['h', 'u', 'g', 's', 'ug', 'hu']}}
Consecutive token with maximum frequency: npu
New corpus tokens: {'p', 'hu', 'n', 'npu', 'pu', 'un', 'gug', 'u', 's', 'h', 'ughu', 'ug', 'g', 'b'}
New tokens by word appearance: {'hug': {'count': 10, 'tokens': ['h', 'u', 'g', 'ug', 'hu']}, 'pug': {'count': 5, 'tokens': ['p', 'u', 'g', 'ug', 'pu']}, 'pun': {'count': 12, 'tokens': ['p', 'u', 'n', 'pu', 'un']}, 'bun': {'count': 4, 'tokens': ['b', 'u', 'n', 'un']}, 'hugs': {'count': 5, 'tokens': ['h', 'u', 'g', 's', 'ug', 'hu']}}
Consecutive token with maximum frequency: puun
New corpus tokens: {'p', 'hu', 'n', 'npu', 'pu', 'un', 'gug', 'puun', 'u', 's', 'h', 'ughu', 'ug', 'g', 'b'}
New tokens by word appearance: {'hug': {'count': 10, 'tokens': ['h', 'u', 'g', 'ug', 'hu']}, 'pug': {'count': 5, 'tokens': ['p', 'u', 'g', 'ug', 'pu']}, 'pun': {'count': 12, 'tokens': ['p', 'u', 'n', 'pu', 'un']}, 'bun': {'count': 4, 'tokens': ['b', 'u', 'n', 'un']}, 'hugs': {'count': 5, 'tokens': ['h', 'u', 'g', 's', 'ug', 'hu']}}

Agora que já vimos como treinar o tokenizador BPE, vamos treiná-lo do zero para consolidar nosso conhecimento.

	
		corpus_words = ["hug", "pug", "pun", "bun", "hugs"]
# Concatenate all the words in the corpus
initial_corpus_tokens = ""
for word in corpus_words:
    initial_corpus_tokens += word
# Convert the concatenated string to a set of tokens to get unique tokens
corpus_tokens = set(initial_corpus_tokens)
print(f"Corpus words: {corpus_words}")
print(f"Initial corpus tokens: {corpus_tokens}")
print(f"Number of initial corpus tokens: {len(corpus_tokens)}")

	
		Corpus words: ['hug', 'pug', 'pun', 'bun', 'hugs']
Initial corpus tokens: {'p', 'n', 'u', 's', 'h', 'g', 'b'}
Number of initial corpus tokens: 7

	
		corpus = [
    "hug hug hug pun pun bun hugs",
    "hug hug pug pug pun pun hugs",
    "hug hug pug pug pun pun pun pun hugs",
    "pug pun pun pun bun hugs",
    "hug hug hug pun bun bun hugs",
]
num_hug = 0
num_pug = 0
num_pun = 0
num_bun = 0
num_hugs = 0
for sentence in corpus:
    words = sentence.split(" ")
    for word in words:
        if word == "hug":
            num_hug += 1
        elif word == "pug":
            num_pug += 1
        elif word == "pun":
            num_pun += 1
        elif word == "bun":
            num_bun += 1
        elif word == "hugs":
            num_hugs += 1
dict_tokens_by_word_appearance = {
    "hug":
        {
            "count": num_hug,
            "tokens": [character for character in "hug"],
        },
    "pug":
        {
            "count": num_pug,
            "tokens": [character for character in "pug"],
        },
    "pun":
        {
            "count": num_pun,
            "tokens": [character for character in "pun"],
        },
    "bun":
        {
            "count": num_bun,
            "tokens": [character for character in "bun"],
        },
    "hugs":
        {
            "count": num_hugs,
            "tokens": [character for character in "hugs"],
        },
}
dict_tokens_by_word_appearance

	
		{'hug': {'count': 10, 'tokens': ['h', 'u', 'g']},
 'pug': {'count': 5, 'tokens': ['p', 'u', 'g']},
 'pun': {'count': 12, 'tokens': ['p', 'u', 'n']},
 'bun': {'count': 4, 'tokens': ['b', 'u', 'n']},
 'hugs': {'count': 5, 'tokens': ['h', 'u', 'g', 's']}}

Nós o treinamos do zero para obter um corpus de 15 tokens.

	
		len_corpus_tokens = 15
print(f"Initial corpus tokens: ({len(corpus_tokens)}) {corpus_tokens}")
while len(corpus_tokens) < len_corpus_tokens:
    max_consecutive_token = get_consecutive_tokens_with_max_frequency(dict_tokens_by_word_appearance, corpus_tokens)
    # If there are no more consecutive tokens break the loop
    if max_consecutive_token is None:
        break
    corpus_tokens = get_new_corpus_tokens(max_consecutive_token, corpus_tokens)
    dict_tokens_by_word_appearance = update_tokens_by_word_appearance(dict_tokens_by_word_appearance, max_consecutive_token)
print(f"New corpus tokens: ({len(corpus_tokens)}) {corpus_tokens}")

	
		Initial corpus tokens: (7) {'p', 'n', 'u', 's', 'h', 'g', 'b'}
New corpus tokens: (15) {'p', 'hu', 'n', 'npu', 'pu', 'un', 'gug', 'puun', 'u', 's', 'h', 'ughu', 'ug', 'g', 'b'}

Tokenização

Se agora quiséssemos tokenizar, primeiro teríamos que criar um vocabulário, ou seja, atribuir um ID a cada token.

vocab = {}
      for i, token in enumerate(corpus_tokens):
          vocab[token] = i
      
      print(f"Vocabulary: ")
      vocab

Vocabulary:

Out[54]:

{'p': 0,
       'hu': 1,
       'sug': 2,
       'npu': 3,
       'ugpu': 4,
       'gug': 5,
       'u': 6,
       'ug': 7,
       'ughu': 8,
       'n': 9,
       'pu': 10,
       'un': 11,
       'puun': 12,
       's': 13,
       'h': 14,
       'gs': 15,
       'g': 16,
       'b': 17}

Nós o colocamos em uma função

	
		vocab = {}
      for i, token in enumerate(corpus_tokens):
          vocab[token] = i
      
      print(f"Vocabulary: ")
      vocab
def get_vocabulary(corpus_tokens):
          vocab = {}
          for i, token in enumerate(corpus_tokens):
              vocab[token] = i
          return vocab

Compromissos corretos

vocab = get_vocabulary(corpus_tokens)
      print(f"Vocabulary: ")
      vocab

Vocabulary:

Out[56]:

{'p': 0,
       'hu': 1,
       'sug': 2,
       'npu': 3,
       'ugpu': 4,
       'gug': 5,
       'u': 6,
       'ug': 7,
       'ughu': 8,
       'n': 9,
       'pu': 10,
       'un': 11,
       'puun': 12,
       's': 13,
       'h': 14,
       'gs': 15,
       'g': 16,
       'b': 17}

Se agora quisermos tokenizar a palavra bug, podemos fazer o seguinte

	
		vocab = {}
for i, token in enumerate(corpus_tokens):
    vocab[token] = i
print(f"Vocabulary: ")
vocab
def get_vocabulary(corpus_tokens):
    vocab = {}
    for i, token in enumerate(corpus_tokens):
        vocab[token] = i
    return vocab
vocab = get_vocabulary(corpus_tokens)
print(f"Vocabulary: ")
vocab
word = 'bug'
# Get the maximum length of tokens
max_len = max(len(token) for token in vocab)
print(f"Maximum length of tokens: {max_len}")
# Create a empty list of tokens
tokens = []
while len(word) > 0:
    # Flag to check if the token is found
    found = False
    # Iterate over the maximum length of tokens from max_len to 0
    for i in range(max_len, 0, -1):
        # Get the prefix of the word
        prefix = word[:i]
        print(f"Prefix: {prefix}")
        # Check if the prefix is in the vocabulary
        if prefix in vocab:
            print(f"prefix {prefix} is in the vocabulary")
            tokens.append(prefix)
            word = word[i:]
            found = True
            break
    
    # if not found:
    #     tokens.append('<UNK>')
    #     word = word[1:]
print(f"Tokens: {tokens}")

	
		Vocabulary: 
Vocabulary: 
Maximum length of tokens: 4
Prefix: bug
Prefix: bug
Prefix: bu
Prefix: b
prefix b is in the vocabulary
Prefix: ug
prefix ug is in the vocabulary
Tokens: ['b', 'ug']

Mas se agora quisermos tokenizar a palavra mug, não poderemos, porque o caractere m não está no vocabulário, então o tokenizaremos com o token <UNK>.

	
		word = 'mug'
# Get the maximum length of tokens
max_len = max(len(token) for token in vocab)
print(f"Maximum length of tokens: {max_len}")
# Create a empty list of tokens
tokens = []
while len(word) > 0:
    # Flag to check if the token is found
    found = False
    # Iterate over the maximum length of tokens from max_len to 0
    for i in range(max_len, 0, -1):
        # Get the prefix of the word
        prefix = word[:i]
        print(f"Prefix: {prefix}")
        # Check if the prefix is in the vocabulary
        if prefix in vocab:
            print(f"prefix {prefix} is in the vocabulary")
            tokens.append(prefix)
            word = word[i:]
            found = True
            break
    if not found:
        tokens.append('<UNK>')
        word = word[1:]
print(f"Tokens: {tokens}")

	
		Maximum length of tokens: 4
Prefix: mug
Prefix: mug
Prefix: mu
Prefix: m
Prefix: ug
prefix ug is in the vocabulary
Tokens: ['<UNK>', 'ug']

Nós o colocamos em uma função

	
		def tokenize_word(word, vocab):
          # Get the maximum length of tokens
          max_len = max(len(token) for token in vocab)
      
          # Create a empty list of tokens
          tokens = []
          while len(word) > 0:
              # Flag to check if the token is found
              found = False
      
              # Iterate over the maximum length of tokens from max_len to 0
              for i in range(max_len, 0, -1):
                  # Get the prefix of the word
                  prefix = word[:i]
      
                  # Check if the prefix is in the vocabulary
                  if prefix in vocab:
                      tokens.append(prefix)
                      word = word[i:]
                      found = True
                      break
      
              if not found:
                  tokens.append('<UNK>')
                  word = word[1:]
      
          return tokens

Verificamos se está tudo bem

	
		def tokenize_word(word, vocab):
    # Get the maximum length of tokens
    max_len = max(len(token) for token in vocab)
    # Create a empty list of tokens
    tokens = []
    while len(word) > 0:
        # Flag to check if the token is found
        found = False
        # Iterate over the maximum length of tokens from max_len to 0
        for i in range(max_len, 0, -1):
            # Get the prefix of the word
            prefix = word[:i]
            # Check if the prefix is in the vocabulary
            if prefix in vocab:
                tokens.append(prefix)
                word = word[i:]
                found = True
                break
        if not found:
            tokens.append('<UNK>')
            word = word[1:]
    return tokens
print(f"Tokenization of the word 'bug': {tokenize_word('bug', vocab)}")
print(f"Tokenization of the word 'mug': {tokenize_word('mug', vocab)}")

	
		Tokenization of the word 'bug': ['b', 'ug']
Tokenization of the word 'mug': ['<UNK>', 'ug']

Visualizador de tokens

Agora que sabemos como funciona um tokenizador BPE, vamos ver como o visualizador the-tokenizer-playground se pareceria com os tokens de qualquer frase

Continuar lendo

Fundamentos de RAG

Esqueça o Ctrl+F! 🤯 Com RAG, seus documentos responderão às suas perguntas diretamente. 😎 Tutorial passo a passo com Hugging Face e ChromaDB. Liberte o poder da IA (e impressione seus amigos)! 💪

Conventional commits

😠 Seus commits parecem escritos em uma língua alienígena? 👽 Junte-se ao clube! 😅 Aprenda Conventional Commits em Python e pare de tortura sua equipe com mensagens enigmáticas. git-changelog e commitizen serão seus novos melhores amigos. 🤝

DoLa – Decoding by Contrasting Layers Improves Factuality in Large Language Models

Você já conversou com um LLM e ele lhe respondeu algo que parece ter bebido café de máquina a noite toda? 😂 Isso é o que chamamos de alucinação no mundo dos LLMs! Mas não se preocupe, pois não é que seu modelo de linguagem esteja louco (embora às vezes possa parecer isso 🤪). A verdade é que os LLMs podem ser um pouco... criativos quando se trata de gerar texto. Mas graças ao DoLa, um método que usa camadas de contraste para melhorar a viabilidade dos LLMs, podemos evitar que nossos modelos de linguagem se transformem em escritores de ficção científica 😂. Nesta publicação, explicarei como o DoLa funciona e mostrarei um exemplo de código para que você possa entender melhor como tornar seus LLMs mais confiáveis e menos propensos a inventar histórias. Vamos salvar nossos LLMs da loucura e torná-los mais úteis! 🚀

Últimos posts -->

Você viu esses projetos?

Subtify

GPU monitor

Alfred

Ver todos os projetos -->

Quer aplicar IA no seu projeto? Entre em contato!

Quer melhorar com essas dicas?

Memory Calculator

Espaço para calcular a memória necessária para executar um modelo

Formatador de código Black

Seu colega Filipao está escrevendo código que é difícil de ler? Compartilhe com ele este formatador de código que eu mostro neste post! Venha e aprenda a formatar o código para torná-lo mais compreensível. Não vamos resolver os problemas de Filipao, mas pelo menos você não sofrerá ao lê-lo

Declare redes neurais de forma clara

Declare redes neurais de forma clara no Pytorch

Últimos tips -->

Use isso localmente

Os espaços do Hugging Face nos permitem executar modelos com demos muito simples, mas e se a demo quebrar? Ou se o usuário a deletar? Por isso, criei contêineres docker com alguns espaços interessantes, para poder usá-los localmente, aconteça o que acontecer. Na verdade, se você clicar em qualquer botão de visualização de projeto, ele pode levá-lo a um espaço que não funciona.