Embeddings

09 of december of 2023

Embeddings

In a previous post about tokens, we already saw the minimum representation of each word. Which corresponds to giving a number to the minimum division of each word.

However, the transformers, and therefore the LLMs, do not represent the information of the words in this way, but do so by means of embeddings.

This notebook has been automatically translated to make it accessible to more people, please let me know if you see any typos.

We are going to see first two ways of representing words, the ordinal encoding and the one hot encoding. And seeing the problems of these two types of representations we will be able to get to word embeddings and sentence embeddings.

We will also see an example of how to train a word embeddings model with the gensim library.

And finally we will see how to use pre-trained models of embeddings with the transformers library of HuggingFace.

Ordinal encoding

This is the most basic way to represent the words inside the transformers. It consists of giving a number to each word, or keeping the numbers already assigned to the tokens.

However, this type of representation has two problems

Let us imagine that table corresponds to token 3, cat to token 1 and dog to token 2. One could assume that table = cat + dog, but it is not so. There is no such relationship between these words. We might even think that by assigning the correct tokens, this type of relationship could occur. However, this thought falls apart with words that have more than one meaning, such as the word bank, for example.
The second problem is that neural networks internally do a lot of numerical calculations, so it could be the case that if mesa has token 3, it has internally more importance than the word cat which has token 1.

So this type of word representation can be discarded very quickly.

One hot encoding

Here what is done is to use N dimensional vectors. For example we saw that OpenAI has a vocabulary of 100277 distinct tokens. So if we use one hot encoding, each word would be represented with a vector of 100277 dimensions.

However, the one hot encoding has two other major problems

It does not take into account the relationship between words. So if we have two words that are synonyms, such as cat and feline, we would have two different vectors to represent them.

In language the relationship between words is very important, and not taking this relationship into account is a big problem.

The second problem is that vectors are very large. If we have a vocabulary of 100277 tokens, each word would be represented by a vector of 100277 dimensions. This makes the vectors very large and computationally very expensive. In addition these vectors are going to be all zeros, except in the position corresponding to the word token. So most of the calculations are going to be multiplications by zero, which are calculations that don't add anything. So we are going to have a lot of memory allocated to vectors where you only have a 1 at a given position.

Word embeddings

With word embeddings we try to solve the problems of the two previous types of representations. For this purpose vectors of N dimensions are used, but in this case vectors of 100277 dimensions are not used, but vectors of much less dimensions are used. For example we will see that OpenAI uses 1536 dimensions.

Each of the dimensions of these vectors represents a characteristic of the word. For example one of the dimensions could represent whether the word is a verb or a noun. Another dimension might represent whether the word is an animal or not. Another dimension might represent whether the word is a proper noun or not. And so on.

However, these features are not defined by hand, but are learned automatically. During the training of the transformers, the values of each of the dimensions of the vectors are adjusted, so that the characteristics of each of the words are learned.

By making each of the word dimensions represent a characteristic of the word, words that have similar characteristics will have similar vectors. For example the words cat and feline will have very similar vectors, since they are both animals. And the words table and chair will have similar vectors, since both are furniture.

In the following image we can see a 3-dimensional representation of words, and we can see that all words related to school are close, all words related to food are close and all words related to ball are close.

word_embedding_3_dimmension

Having each of the dimensions of the vectors represent a characteristic of the word allows us to perform operations with words. For example, if we subtract the word king from the word man and add the word woman, we get a word very similar to the word queen. We will check it later with an example

Similarity between words

As each word is represented by a vector of N dimensions, we can calculate the similarity between two words. The cosine similarity function or cosine similarity is used for this purpose.

If two words are close in vector space, it means that the angle between their vectors is small, so their cosine is close to 1. If there is an angle of 90 degrees between the vectors, the cosine is 0, meaning that there is no similarity between the words. And if there is an angle of 180 degrees between the vectors, the cosine is -1, that is, the words are opposites.

cosine similarity

Example with OpenAI embeddings

Now that we know what embeddings are, let's see some examples with the embeddings provided by the API of OpenAI.

To do this we first need to have the OpenAI package installed.

pip install openai
      ```

We import the necessary libraries

	
		from openai import OpenAI
      import torch
      from torch.nn.functional import cosine_similarity

We use an OpenAI API key. To do this, go to the OpenAI page, and register. Once registered, go to the API Keys section, and create a new API Key.

open ai api key

	
		from openai import OpenAI
      import torch
      from torch.nn.functional import cosine_similarity
api_key = "Pon aquí tu API key"

We select which embeddings model we want to use. In this case we are going to use text-embedding-ada-002 which is the one recommended by OpenAI in its embeddings documentation.

	
		from openai import OpenAI
      import torch
      from torch.nn.functional import cosine_similarity
api_key = "Pon aquí tu API key"
model_openai = "text-embedding-ada-002"

Create an API client

	
		from openai import OpenAI
      import torch
      from torch.nn.functional import cosine_similarity
api_key = "Pon aquí tu API key"
model_openai = "text-embedding-ada-002"
client_openai = OpenAI(api_key=api_key, organization=None)

Let's see how are the embeddings of the word King.

	
		from openai import OpenAI
import torch
from torch.nn.functional import cosine_similarity
api_key = "Pon aquí tu API key"
model_openai = "text-embedding-ada-002"
client_openai = OpenAI(api_key=api_key, organization=None)
word = "Rey"
embedding_openai = torch.Tensor(client_openai.embeddings.create(input=word, model=model_openai).data[0].embedding)
embedding_openai.shape, embedding_openai

	
		(torch.Size([1536]),
 tensor([-0.0103, -0.0005, -0.0189,  ..., -0.0009, -0.0226,  0.0045]))

As we can see we obtain a vector of 1536 dimensions

Operations with words

Let's get the embeddings of the words king, man, woman and queen.

	
		embedding_openai_rey = torch.Tensor(client_openai.embeddings.create(input="rey", model=model_openai).data[0].embedding)
      embedding_openai_hombre = torch.Tensor(client_openai.embeddings.create(input="hombre", model=model_openai).data[0].embedding)
      embedding_openai_mujer = torch.Tensor(client_openai.embeddings.create(input="mujer", model=model_openai).data[0].embedding)
      embedding_openai_reina = torch.Tensor(client_openai.embeddings.create(input="reina", model=model_openai).data[0].embedding)

	
		embedding_openai_rey = torch.Tensor(client_openai.embeddings.create(input="rey", model=model_openai).data[0].embedding)
embedding_openai_hombre = torch.Tensor(client_openai.embeddings.create(input="hombre", model=model_openai).data[0].embedding)
embedding_openai_mujer = torch.Tensor(client_openai.embeddings.create(input="mujer", model=model_openai).data[0].embedding)
embedding_openai_reina = torch.Tensor(client_openai.embeddings.create(input="reina", model=model_openai).data[0].embedding)
embedding_openai_reina.shape, embedding_openai_reina

	
		(torch.Size([1536]),
 tensor([-0.0110, -0.0084, -0.0115,  ...,  0.0082, -0.0096, -0.0024]))

Let's obtain the embedding resulting from subtracting the embedding of man from king and adding the embedding of woman to king.

	
		embedding_openai = embedding_openai_rey - embedding_openai_hombre + embedding_openai_mujer

	
		embedding_openai = embedding_openai_rey - embedding_openai_hombre + embedding_openai_mujer
embedding_openai.shape, embedding_openai

	
		(torch.Size([1536]),
 tensor([-0.0226, -0.0323,  0.0017,  ...,  0.0014, -0.0290, -0.0188]))

Finally we compare the result obtained with the embedding of reina. For this we use the cosine_similarity function provided by the pytorch library.

	
		similarity_openai = cosine_similarity(embedding_openai.unsqueeze(0), embedding_openai_reina.unsqueeze(0)).item()
print(f"similarity_openai: {similarity_openai}")

	
		similarity_openai: 0.7564167976379395

As we can see it is a value very close to 1, so we can say that the result obtained is very similar to the embedding of reina.

If we use English words, we get a result closer to 1.

	
		embedding_openai_rey = torch.Tensor(client_openai.embeddings.create(input="king", model=model_openai).data[0].embedding)
      embedding_openai_hombre = torch.Tensor(client_openai.embeddings.create(input="man", model=model_openai).data[0].embedding)
      embedding_openai_mujer = torch.Tensor(client_openai.embeddings.create(input="woman", model=model_openai).data[0].embedding)
      embedding_openai_reina = torch.Tensor(client_openai.embeddings.create(input="queen", model=model_openai).data[0].embedding)

	
		embedding_openai_rey = torch.Tensor(client_openai.embeddings.create(input="king", model=model_openai).data[0].embedding)
      embedding_openai_hombre = torch.Tensor(client_openai.embeddings.create(input="man", model=model_openai).data[0].embedding)
      embedding_openai_mujer = torch.Tensor(client_openai.embeddings.create(input="woman", model=model_openai).data[0].embedding)
      embedding_openai_reina = torch.Tensor(client_openai.embeddings.create(input="queen", model=model_openai).data[0].embedding)
embedding_openai = embedding_openai_rey - embedding_openai_hombre + embedding_openai_mujer

	
		embedding_openai_rey = torch.Tensor(client_openai.embeddings.create(input="king", model=model_openai).data[0].embedding)
embedding_openai_hombre = torch.Tensor(client_openai.embeddings.create(input="man", model=model_openai).data[0].embedding)
embedding_openai_mujer = torch.Tensor(client_openai.embeddings.create(input="woman", model=model_openai).data[0].embedding)
embedding_openai_reina = torch.Tensor(client_openai.embeddings.create(input="queen", model=model_openai).data[0].embedding)
embedding_openai = embedding_openai_rey - embedding_openai_hombre + embedding_openai_mujer
similarity_openai = cosine_similarity(embedding_openai.unsqueeze(0), embedding_openai_reina.unsqueeze(0))
print(f"similarity_openai: {similarity_openai}")

	
		similarity_openai: tensor([0.8849])

This is normal, since the OpenAi model has been trained with more texts in English than in Spanish.

Types of word embeddings

There are several types of word embeddings, and each of them has its advantages and disadvantages. Let's take a look at the most important ones

Word2Vec
GloVe
FastText
BERT
GPT-2

Word2Vec

Word2Vec is an algorithm used to create word embeddings. This algorithm was created by Google in 2013, and it is one of the most used algorithms to create word embeddings.

It has two variants, CBOW and Skip-gram. CBOW is faster to train, while Skip-gram is more accurate. Let's see how each of them works

CBOW

CBOWorContinuous Bag of Wordsis an algorithm used to predict a word from the surrounding words. For example if we have the sentenceThe cat is an animal, the algorithm will try to predict the wordcatfrom the surrounding words, in this caseThe,is,anandanimal`.

CBOW

In this architecture, the model predicts which is the most likely word in the given context. Therefore, words that have the same probability of appearing are considered similar and are therefore closer in dimensional space.

Suppose that in a sentence we replace boat with boat, then the model predicts the probability for both and if it turns out to be similar then we can consider that the words are similar.

Skip-gram

Skip-gram or Skip-gram with Negative Sampling is an algorithm used to predict the words surrounding a word. For example if we have the sentence The cat is an animal, the algorithm will try to predict the words The, is, an and animal from the word cat.

Skip-gram

This architecture is similar to that of CBOW, but instead the model works backwards. The model predicts the context using the given word. Therefore, words that have the same context are considered similar and are therefore closer in dimensional space.

GloVe

GloVeorGlobal Vectors for Word Representation` is an algorithm used to create word embeddings. This algorithm was created by Stanford University in 2014.

Word2Vec ignores the fact that some context words occur more frequently than others and also only take into account the local context and therefore do not capture the global context.

This algorithm uses a co-occurrence matrix to create the word embeddings. This co-occurrence matrix is a matrix that contains the number of times each word appears next to each of the other words in the vocabulary.

FastText

FastText is an algorithm that is used to create word embeddings. This algorithm was created by Facebook in 2016.

One of the main disadvantages of Word2Vec and GloVe is that they cannot encode unknown or out-of-vocabulary words.

So, to deal with this problem, Facebook proposed a FastText model. It is an extension of Word2Vec and follows the same Skip-gram and CBOW model. But unlike Word2Vec which feeds whole words into the neural network, FastText first splits words into several subwords (or n-grams) and then feeds them to the neural network.

For example, if the value of n is 3 and the word is apple then your tri-gram will be [<ma, man, anz, nza, zan, ana, na>] and your word embedding will be the sum of the vector representation of these tri-grams. Here, the hyperparameters min_n and max_n are considered as 3 and the characters < and > represent the beginning and end of the word.

Therefore, using this methodology, unknown words can be represented in vector form, since it has a high probability that their n-grams are also present in other words.

This algorithm is an improvement of Word2Vec, since in addition to taking into account the words surrounding a word, it also takes into account the n-grams of the word. For example if we have the word cat, it also takes into account the n-grams of the word, in this case ga, at and to, for n = 2.

Limitations of word embeddings

Word embedding techniques have given a decent result, but the problem is that the approach is not precise enough. They do not take into account the order of the words in which they appear, which leads to loss of syntactic and semantic understanding of the sentence.

For example, You go there to teach, not to play AND You go there to play, not to teach Both sentences will have the same representation in vector space, but they do not mean the same thing.

In addition, the word embedding model cannot give satisfactory results on a large amount of text data, since the same word may have a different meaning in a different sentence depending on the context of the sentence.

For example, I am going to sit in the bank AND I am going to do business in the bank In both sentences, the word bank has different meanings.

Therefore, we require a type of representation that can retain the contextual meaning of the word present in a sentence.

Sentence embeddings

Sentence embedding is similar to word embedding, but instead of words, they encode the whole sentence in the vector representation.

A simple way to obtain sentence embedding is to average the word embedding of all the words present in the sentence. But they are not accurate enough.

Some of the most advanced models for sentence embedding are ELMo, InferSent and Sentence-BERT.

ELMo

ELMo or Embeddings from Language Models is a sentence embedding model that was created by Allen University in 2018. It uses a bidirectional deep LSTM network to produce vector representation. ELMo can represent unknown or out-of-vocabulary words in vector form since it is character-based.

InferSent

InferSent is a sentence embedding model that was created by Facebook in 2017. It uses a bidirectional deep LSTM network to produce vector representation. InferSent can represent unknown or out-of-vocabulary words in vector form as it is character-based. Sentences are encoded in a 4096-dimensional vector representation.

The training of the model is performed on the Stanford Natural Language Inference (SNLI) dataset. This dataset is labeled and written by humans for about 500K sentence pairs.

Sentence-BERT

Sentence-BERT is a sentence embedding model that was created by the University of London in 2019. It uses a bidirectional deep LSTM network to produce vector representation. Sentence-BERT can represent unknown or out-of-vocabulary words in vector form as it is character-based. Sentences are encoded in a 768-dimensional vector representation.

The state-of-the-art NLP model BERT is excellent in Semantic Textual Similarity tasks, but the problem is that it would take a long time for a huge corpus (65 hours for 10,000 sentences), as it requires both sentences to be entered into the network and this increases the computation by a huge factor.

Therefore, Sentence-BERT is a modification of the BERT model.

Training of a word2vec model with gensim

To download the dataset we are going to use, the dataset library of huggingface must be installed:

pip install datasets

To train the embeddings model we are going to use the gensim library. To install it with conda we use

conda install -c conda-forge gensim

And to install it with pip we use

pip install gensim

To clean the dataset we have downloaded we are going to use regular expressions, which is usually already installed in python, and nltk which is a natural language processing library. To install it with conda we use

conda install -c anaconda nltk

And to install it with pip we use

pip install nltk

Now that we have everything installed we can import the libraries we are going to use:

	
		from gensim.models import Word2Vec
      from gensim.parsing.preprocessing import strip_punctuation, strip_numeric, strip_short
      import re
      from nltk.corpus import stopwords
      from nltk.tokenize import word_tokenize

Download dataset

We are going to download a dataset of texts from wikipedia in Spanish, for this we execute the following:

	
		from gensim.models import Word2Vec
      from gensim.parsing.preprocessing import strip_punctuation, strip_numeric, strip_short
      import re
      from nltk.corpus import stopwords
      from nltk.tokenize import word_tokenize
from datasets import load_dataset
      
      dataset_corpus = load_dataset('large_spanish_corpus', name='all_wikis')

Let's see what it looks like

	
		from gensim.models import Word2Vec
from gensim.parsing.preprocessing import strip_punctuation, strip_numeric, strip_short
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from datasets import load_dataset
dataset_corpus = load_dataset('large_spanish_corpus', name='all_wikis')
dataset_corpus

	
		DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 28109484
    })
})

As we can see, the dataset has more than 28 million texts. Let's take a look at some of them:

	
		dataset_corpus['train']['text'][0:10]

	
		['¡Bienvenidos!',
 'Ir a los contenidos»',
 '= Contenidos =',
 '',
 'Portada',
 'Tercera Lengua más hablada en el mundo.',
 'La segunda en número de habitantes en el mundo occidental.',
 'La de mayor proyección y crecimiento día a día.',
 'El español es, hoy en día, nombrado en cada vez más contextos, tomando realce internacional como lengua de cultura y civilización siempre de mayor envergadura.',
 'Ejemplo de ello es que la comunidad minoritaria más hablada en los Estados Unidos es precisamente la que habla idioma español.']

As there are many examples we will create a subset of 10 million examples to work faster:

	
		subset = dataset_corpus['train'].select(range(10000000))

Dataset cleaning

Now we download the stopwords from nltk, which are words that do not provide information and that we are going to eliminate from the texts.

import nltk
      nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
      [nltk_data]     /home/wallabot/nltk_data...
      [nltk_data]   Package stopwords is already up-to-date!

True

Now we are going to download the punkt of nltk, which is a tokenizer that will allow us to separate the texts into sentences

nltk.download('punkt')

[nltk_data] Downloading package punkt to /home/wallabot/nltk_data...
      [nltk_data]   Package punkt is already up-to-date!

True

We create a function to clean the data, this function is going to:

Change the text to lower case
Remove urls
Remove mentions to social networks such as @twitter p #hashtag.
Eliminate punctuation marks
Eliminate numbers
Eliminate short words
Eliminate stopwords

As we are using a huggeface dataset, the texts are in dict format, so we return a dictionary.

	
		subset = dataset_corpus['train'].select(range(10000000))
import nltk
      nltk.download('stopwords')
nltk.download('punkt')
def clean_text(sentence_batch):
          # extrae el texto de la entrada
          text_list = sentence_batch['text']
      
          cleaned_text_list = []
          for text in text_list:
              # Convierte el texto a minúsculas
              text = text.lower()
      
              # Elimina URLs
              text = re.sub(r'httpS+|wwwS+|httpsS+', '', text, flags=re.MULTILINE)
      
              # Elimina las menciones @ y '#' de las redes sociales
              text = re.sub(r'@w+|#w+', '', text)
      
              # Elimina los caracteres de puntuación
              text = strip_punctuation(text)
      
              # Elimina los números
              text = strip_numeric(text)
      
              # Elimina las palabras cortas
              text = strip_short(text,minsize=2)
      
              # Elimina las palabras comunes (stop words)
              stop_words = set(stopwords.words('spanish'))
              word_tokens = word_tokenize(text)
              filtered_text = [word for word in word_tokens if word not in stop_words]
      
              cleaned_text_list.append(filtered_text)
      
          # Devuelve el texto limpio
          return {'text': cleaned_text_list}

We apply the function to the data

	
		subset = dataset_corpus['train'].select(range(10000000))
import nltk
nltk.download('stopwords')
nltk.download('punkt')
def clean_text(sentence_batch):
    # extrae el texto de la entrada
    text_list = sentence_batch['text']
    cleaned_text_list = []
    for text in text_list:
        # Convierte el texto a minúsculas
        text = text.lower()
        # Elimina URLs
        text = re.sub(r'httpS+|wwwS+|httpsS+', '', text, flags=re.MULTILINE)
        # Elimina las menciones @ y '#' de las redes sociales
        text = re.sub(r'@w+|#w+', '', text)
        # Elimina los caracteres de puntuación
        text = strip_punctuation(text)
        # Elimina los números
        text = strip_numeric(text)
        # Elimina las palabras cortas
        text = strip_short(text,minsize=2)
        # Elimina las palabras comunes (stop words)
        stop_words = set(stopwords.words('spanish'))
        word_tokens = word_tokenize(text)
        filtered_text = [word for word in word_tokens if word not in stop_words]
        cleaned_text_list.append(filtered_text)
    # Devuelve el texto limpio
    return {'text': cleaned_text_list}
sentences_corpus = subset.map(clean_text, batched=True)

	
		[nltk_data] Downloading package stopwords to
[nltk_data]     /home/wallabot/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/wallabot/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
Map:   0%|          | 0/10000000 [00:00<?, ? examples/s]

Let's save the filtered dataset in a file to avoid having to run the cleaning process again.

	
		sentences_corpus.save_to_disk("sentences_corpus")

	
		Saving the dataset (0/4 shards):   0%|          | 0/15000000 [00:00<?, ? examples/s]

To load it we can do

	
		from datasets import load_from_disk
      sentences_corpus = load_from_disk('sentences_corpus')

Now what we are going to have is a list of lists, where each list is a tokenized phrase and without stopwords. That is, we have a list of phrases, and each phrase is a list of words. Let's see what it looks like:

	
		from datasets import load_from_disk
sentences_corpus = load_from_disk('sentences_corpus')
for i in range(10):
    print(f'La frase "{subset["text"][i]}" se convierte en la lista de palabras "{sentences_corpus["text"][i]}"')

	
		La frase "¡Bienvenidos!" se convierte en la lista de palabras "['¡bienvenidos']"
La frase "Ir a los contenidos»" se convierte en la lista de palabras "['ir', 'contenidos', '»']"
La frase "= Contenidos =" se convierte en la lista de palabras "['contenidos']"
La frase "" se convierte en la lista de palabras "[]"
La frase "Portada" se convierte en la lista de palabras "['portada']"
La frase "Tercera Lengua más hablada en el mundo." se convierte en la lista de palabras "['tercera', 'lengua', 'hablada', 'mundo']"
La frase "La segunda en número de habitantes en el mundo occidental." se convierte en la lista de palabras "['segunda', 'número', 'habitantes', 'mundo', 'occidental']"
La frase "La de mayor proyección y crecimiento día a día." se convierte en la lista de palabras "['mayor', 'proyección', 'crecimiento', 'día', 'día']"
La frase "El español es, hoy en día, nombrado en cada vez más contextos, tomando realce internacional como lengua de cultura y civilización siempre de mayor envergadura." se convierte en la lista de palabras "['español', 'hoy', 'día', 'nombrado', 'cada', 'vez', 'contextos', 'tomando', 'realce', 'internacional', 'lengua', 'cultura', 'civilización', 'siempre', 'mayor', 'envergadura']"
La frase "Ejemplo de ello es que la comunidad minoritaria más hablada en los Estados Unidos es precisamente la que habla idioma español." se convierte en la lista de palabras "['ejemplo', 'ello', 'comunidad', 'minoritaria', 'hablada', 'unidos', 'precisamente', 'habla', 'idioma', 'español']"

Training of the word2vec model

We are going to train an embeddings model that will convert words into vectors. For this we are going to use the gensim library and its Word2Vec model.

	
		dataset = sentences_corpus['text']
      dim_embedding = 100
      window_size = 5 # 5 palabras a la izquierda y 5 palabras a la derecha
      min_count = 5 # Ignora las palabras con frecuencia menor a 5
      workers = 4 # Número de hilos de ejecución
      sg = 1 # 0 para CBOW, 1 para Skip-gram
      
      model = Word2Vec(dataset, vector_size=dim_embedding, window=window_size, min_count=min_count, workers=workers, sg=sg)

This model has been trained on the CPU, since gensim does not have the option to perform the training on the GPU and even so on my computer it has taken X minutes to train the model. Although the size of the embedding we have chosen is only 100 (as opposed to the size of the openai embeddings which is 1536), it is not a very long time, since the dataset has 10 million sentences.

Large language models are trained with datasets of billions of sentences, so it is normal that training an embeddings model with a dataset of 10 million sentences takes a few minutes.

Once the model is trained we save it in a file for future use.

	
		dataset = sentences_corpus['text']
      dim_embedding = 100
      window_size = 5 # 5 palabras a la izquierda y 5 palabras a la derecha
      min_count = 5 # Ignora las palabras con frecuencia menor a 5
      workers = 4 # Número de hilos de ejecución
      sg = 1 # 0 para CBOW, 1 para Skip-gram
      
      model = Word2Vec(dataset, vector_size=dim_embedding, window=window_size, min_count=min_count, workers=workers, sg=sg)
model.save('word2vec.model')

If we would like to load it in the future, we can do so with

	
		dataset = sentences_corpus['text']
      dim_embedding = 100
      window_size = 5 # 5 palabras a la izquierda y 5 palabras a la derecha
      min_count = 5 # Ignora las palabras con frecuencia menor a 5
      workers = 4 # Número de hilos de ejecución
      sg = 1 # 0 para CBOW, 1 para Skip-gram
      
      model = Word2Vec(dataset, vector_size=dim_embedding, window=window_size, min_count=min_count, workers=workers, sg=sg)
model.save('word2vec.model')
model = Word2Vec.load('word2vec.model')

Evaluation of the word2vec model

Let's see the most similar words of some words

	
		dataset = sentences_corpus['text']
dim_embedding = 100
window_size = 5 # 5 palabras a la izquierda y 5 palabras a la derecha
min_count = 5 # Ignora las palabras con frecuencia menor a 5
workers = 4 # Número de hilos de ejecución
sg = 1 # 0 para CBOW, 1 para Skip-gram
model = Word2Vec(dataset, vector_size=dim_embedding, window=window_size, min_count=min_count, workers=workers, sg=sg)
model.save('word2vec.model')
model = Word2Vec.load('word2vec.model')
model.wv.most_similar('perro', topn=10)

	
		[('gato', 0.7948548197746277),
 ('perros', 0.77247554063797),
 ('cachorro', 0.7638891339302063),
 ('hámster', 0.7540281414985657),
 ('caniche', 0.7514827251434326),
 ('bobtail', 0.7492328882217407),
 ('mastín', 0.7491254210472107),
 ('lobo', 0.7312178611755371),
 ('semental', 0.7292628288269043),
 ('sabueso', 0.7290207147598267)]

	
		model.wv.most_similar('gato', topn=10)

	
		[('conejo', 0.8148329854011536),
 ('zorro', 0.8109457492828369),
 ('perro', 0.7948548793792725),
 ('lobo', 0.7878773808479309),
 ('ardilla', 0.7860757112503052),
 ('mapache', 0.7817519307136536),
 ('huiña', 0.766639232635498),
 ('oso', 0.7656188011169434),
 ('mono', 0.7633568644523621),
 ('camaleón', 0.7623056769371033)]

Now let's see the example in which we check the similarity of the word queen with the result of subtracting the word man from the word king and adding the word woman to the word king.

	
		embedding_hombre = model.wv['hombre']
      embedding_mujer = model.wv['mujer']
      embedding_rey = model.wv['rey']
      embedding_reina = model.wv['reina']

	
		embedding_hombre = model.wv['hombre']
      embedding_mujer = model.wv['mujer']
      embedding_rey = model.wv['rey']
      embedding_reina = model.wv['reina']
embedding = embedding_rey - embedding_hombre + embedding_mujer

	
		embedding_hombre = model.wv['hombre']
embedding_mujer = model.wv['mujer']
embedding_rey = model.wv['rey']
embedding_reina = model.wv['reina']
embedding = embedding_rey - embedding_hombre + embedding_mujer
from torch.nn.functional import cosine_similarity
embedding = torch.tensor(embedding).unsqueeze(0)
embedding_reina = torch.tensor(embedding_reina).unsqueeze(0)
similarity = cosine_similarity(embedding, embedding_reina, dim=1)
similarity

	
		tensor([0.8156])

As we can see, there is a lot of similarity

Display of embeddings

We are going to visualize the embeddings, for this we first obtain the vectors and the words of the model

	
		embeddings = model.wv.vectors
      words = list(model.wv.index_to_key)

As the dimension of the embeddings is 100, to be able to visualize them in 2 or 3 dimensions we have to reduce the dimension. For this we will use PCA (faster) or TSNE (more precise) of sklearn.

	
		embeddings = model.wv.vectors
      words = list(model.wv.index_to_key)
from sklearn.decomposition import PCA
      
      dimmesions = 2
      pca = PCA(n_components=dimmesions)
      reduced_embeddings_PCA = pca.fit_transform(embeddings)

	
		embeddings = model.wv.vectors
words = list(model.wv.index_to_key)
from sklearn.decomposition import PCA
dimmesions = 2
pca = PCA(n_components=dimmesions)
reduced_embeddings_PCA = pca.fit_transform(embeddings)
from sklearn.manifold import TSNE
dimmesions = 2
tsne = TSNE(n_components=dimmesions, verbose=1, perplexity=40, n_iter=300)
reduced_embeddings_tsne = tsne.fit_transform(embeddings)

	
		[t-SNE] Computing 121 nearest neighbors...
[t-SNE] Indexed 493923 samples in 0.013s...
[t-SNE] Computed neighbors for 493923 samples in 377.143s...
[t-SNE] Computed conditional probabilities for sample 1000 / 493923
[t-SNE] Computed conditional probabilities for sample 2000 / 493923
[t-SNE] Computed conditional probabilities for sample 3000 / 493923
[t-SNE] Computed conditional probabilities for sample 4000 / 493923
[t-SNE] Computed conditional probabilities for sample 5000 / 493923
[t-SNE] Computed conditional probabilities for sample 6000 / 493923
[t-SNE] Computed conditional probabilities for sample 7000 / 493923
[t-SNE] Computed conditional probabilities for sample 8000 / 493923
[t-SNE] Computed conditional probabilities for sample 9000 / 493923
[t-SNE] Computed conditional probabilities for sample 10000 / 493923
[t-SNE] Computed conditional probabilities for sample 11000 / 493923
[t-SNE] Computed conditional probabilities for sample 12000 / 493923
[t-SNE] Computed conditional probabilities for sample 13000 / 493923
[t-SNE] Computed conditional probabilities for sample 14000 / 493923
[t-SNE] Computed conditional probabilities for sample 15000 / 493923
[t-SNE] Computed conditional probabilities for sample 16000 / 493923
[t-SNE] Computed conditional probabilities for sample 17000 / 493923
[t-SNE] Computed conditional probabilities for sample 18000 / 493923
[t-SNE] Computed conditional probabilities for sample 19000 / 493923
[t-SNE] Computed conditional probabilities for sample 20000 / 493923
[t-SNE] Computed conditional probabilities for sample 21000 / 493923
[t-SNE] Computed conditional probabilities for sample 22000 / 493923
...
[t-SNE] Computed conditional probabilities for sample 493923 / 493923
[t-SNE] Mean sigma: 0.275311
[t-SNE] KL divergence after 250 iterations with early exaggeration: 117.413788
[t-SNE] KL divergence after 300 iterations: 5.774648

Now we visualize them in 2 dimensions with matplotlib. Let's visualize the dimensionality reduction we have done with PCA and with TSNE.

import matplotlib.pyplot as plt
      
      plt.figure(figsize=(10, 10))
      for i, word in enumerate(words[:200]): # Limitar a las primeras 200 palabras
          plt.scatter(reduced_embeddings_PCA[i, 0], reduced_embeddings_PCA[i, 1])
          plt.annotate(word, xy=(reduced_embeddings_PCA[i, 0], reduced_embeddings_PCA[i, 1]), xytext=(5, 2), textcoords='offset points', ha='right', va='bottom')
      plt.title('Embeddings (PCA)')
      plt.show()

plt.figure(figsize=(10, 10))
      for i, word in enumerate(words[:200]): # Limitar a las primeras 200 palabras
          plt.scatter(reduced_embeddings_tsne[i, 0], reduced_embeddings_tsne[i, 1])
          plt.annotate(word, xy=(reduced_embeddings_tsne[i, 0], reduced_embeddings_tsne[i, 1]), xytext=(5, 2),
                       textcoords='offset points', ha='right', va='bottom')
      plt.show()

Use of pre-trained models with huggingface

To use pre-trained embeddings models we will use the transformers library from huggingface. To install it with conda we use

conda install -c conda-forge transformers

And to install it with pip we use

pip install transformers

With the feature-extraction task of huggingface we can use pre-trained models to obtain the embeddings of the words. To do this we first import the necessary library

	
		import matplotlib.pyplot as plt
      
      plt.figure(figsize=(10, 10))
      for i, word in enumerate(words[:200]): # Limitar a las primeras 200 palabras
          plt.scatter(reduced_embeddings_PCA[i, 0], reduced_embeddings_PCA[i, 1])
          plt.annotate(word, xy=(reduced_embeddings_PCA[i, 0], reduced_embeddings_PCA[i, 1]), xytext=(5, 2), textcoords='offset points', ha='right', va='bottom')
      plt.title('Embeddings (PCA)')
      plt.show()
plt.figure(figsize=(10, 10))
      for i, word in enumerate(words[:200]): # Limitar a las primeras 200 palabras
          plt.scatter(reduced_embeddings_tsne[i, 0], reduced_embeddings_tsne[i, 1])
          plt.annotate(word, xy=(reduced_embeddings_tsne[i, 0], reduced_embeddings_tsne[i, 1]), xytext=(5, 2),
                       textcoords='offset points', ha='right', va='bottom')
      plt.show()
from transformers import pipeline

Let's get the embeddings from BERT.

	
		import matplotlib.pyplot as plt
      
      plt.figure(figsize=(10, 10))
      for i, word in enumerate(words[:200]): # Limitar a las primeras 200 palabras
          plt.scatter(reduced_embeddings_PCA[i, 0], reduced_embeddings_PCA[i, 1])
          plt.annotate(word, xy=(reduced_embeddings_PCA[i, 0], reduced_embeddings_PCA[i, 1]), xytext=(5, 2), textcoords='offset points', ha='right', va='bottom')
      plt.title('Embeddings (PCA)')
      plt.show()
plt.figure(figsize=(10, 10))
      for i, word in enumerate(words[:200]): # Limitar a las primeras 200 palabras
          plt.scatter(reduced_embeddings_tsne[i, 0], reduced_embeddings_tsne[i, 1])
          plt.annotate(word, xy=(reduced_embeddings_tsne[i, 0], reduced_embeddings_tsne[i, 1]), xytext=(5, 2),
                       textcoords='offset points', ha='right', va='bottom')
      plt.show()
from transformers import pipeline
checkpoint = "bert-base-uncased"
      feature_extractor = pipeline("feature-extraction",framework="pt",model=checkpoint)

Let's see the embeddings of the word king.

	
		import matplotlib.pyplot as plt
plt.figure(figsize=(10, 10))
for i, word in enumerate(words[:200]): # Limitar a las primeras 200 palabras
    plt.scatter(reduced_embeddings_PCA[i, 0], reduced_embeddings_PCA[i, 1])
    plt.annotate(word, xy=(reduced_embeddings_PCA[i, 0], reduced_embeddings_PCA[i, 1]), xytext=(5, 2), textcoords='offset points', ha='right', va='bottom')
plt.title('Embeddings (PCA)')
plt.show()
plt.figure(figsize=(10, 10))
for i, word in enumerate(words[:200]): # Limitar a las primeras 200 palabras
    plt.scatter(reduced_embeddings_tsne[i, 0], reduced_embeddings_tsne[i, 1])
    plt.annotate(word, xy=(reduced_embeddings_tsne[i, 0], reduced_embeddings_tsne[i, 1]), xytext=(5, 2),
                 textcoords='offset points', ha='right', va='bottom')
plt.show()
from transformers import pipeline
checkpoint = "bert-base-uncased"
feature_extractor = pipeline("feature-extraction",framework="pt",model=checkpoint)
embedding = feature_extractor("rey", return_tensors="pt").squeeze(0)
embedding.shape

	
		torch.Size([3, 768])

As we see we obtain a vector of 768 dimensions, that is to say, the embeddings of BERT have 768 dimensions. On the other hand we see that it has 3 vectors of embeddings, this is because BERT adds a token at the beginning and another at the end of the sentence, so we are only interested in the middle vector.

Let us redo the example in which we check the similarity of the word queen with the result of subtracting the word man from the word king and adding the word woman to the word king.

	
		embedding_hombre = feature_extractor("man", return_tensors="pt").squeeze(0)[1]
      embedding_mujer = feature_extractor("woman", return_tensors="pt").squeeze(0)[1]
      embedding_rey = feature_extractor("king", return_tensors="pt").squeeze(0)[1]
      embedding_reina = feature_extractor("queen", return_tensors="pt").squeeze(0)[1]

	
		embedding_hombre = feature_extractor("man", return_tensors="pt").squeeze(0)[1]
      embedding_mujer = feature_extractor("woman", return_tensors="pt").squeeze(0)[1]
      embedding_rey = feature_extractor("king", return_tensors="pt").squeeze(0)[1]
      embedding_reina = feature_extractor("queen", return_tensors="pt").squeeze(0)[1]
embedding = embedding_rey - embedding_hombre + embedding_mujer

Let's see the similarity

import torch
      from torch.nn.functional import cosine_similarity
      
      embedding = torch.tensor(embedding).unsqueeze(0)
      embedding_reina = torch.tensor(embedding_reina).unsqueeze(0)
      
      similarity = cosine_similarity(embedding, embedding_reina, dim=1)
      similarity.item()

/tmp/ipykernel_33343/4248442045.py:4: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
        embedding = torch.tensor(embedding).unsqueeze(0)
      /tmp/ipykernel_33343/4248442045.py:5: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
        embedding_reina = torch.tensor(embedding_reina).unsqueeze(0)

Out[60]:

0.742547333240509

Using the embeddings of BERT we also get a result very close to 1

Continue reading

Langchain with open source integrations

Learn how to use Langchain with the most popular open-source integrations. In this post, we will explore how to integrate Langchain with ChromaDB, Ollama and HuggingFace.

Conventional commits

😠 ¿Tus commits parecen escritos en lenguaje alienígena? 👽 ¡Únete al club! 😅 Aprende Conventional Commits en Python y deja de torturar a tu equipo con mensajes crípticos. git-changelog y commitizen serán tus nuevos mejores amigos. 🤝

RAG fundamentals

Forget about Ctrl+F! 🤯 With RAG, your documents will answer your questions directly. 😎 Step-by-step tutorial with Hugging Face and ChromaDB. Unleash the power of AI (and show off to your friends)! 💪

Last posts -->

Have you seen these projects?

Subtify

GPU monitor

Alfred

View all projects -->

Do you want to apply AI in your project? Contact me!

Do you want to improve with these tips?

Memory profiler

See the memory usage of a script

DataLoader with pin_memory and num_workers

Increase DataLoader performance with pin_memory and num_workers

py-smi

Python library to get GPU data like `nvidia-smi`

Last tips -->

Use this locally

Hugging Face spaces allow us to run models with very simple demos, but what if the demo breaks? Or if the user deletes it? That's why I've created docker containers with some interesting spaces, to be able to use them locally, whatever happens. In fact, if you click on any project view button, it may take you to a space that doesn't work.