Embeddings
In a previous post about tokens, we already saw the minimum representation of each word. Which corresponds to giving a number to the minimum division of each word.
However, the transformers, and therefore the LLMs, do not represent the information of the words in this way, but do so by means of embeddings
.
This notebook has been automatically translated to make it accessible to more people, please let me know if you see any typos.
We are going to see first two ways of representing words, the ordinal encoding
and the one hot encoding
. And seeing the problems of these two types of representations we will be able to get to word embeddings
and sentence embeddings
.
We will also see an example of how to train a word embeddings
model with the gensim
library.
And finally we will see how to use pre-trained models of embeddings
with the transformers
library of HuggingFace.
Ordinal encoding
This is the most basic way to represent the words inside the transformers. It consists of giving a number to each word, or keeping the numbers already assigned to the tokens.
However, this type of representation has two problems
Let us imagine that table corresponds to token 3, cat to token 1 and dog to token 2. One could assume that
table = cat + dog
, but it is not so. There is no such relationship between these words. We might even think that by assigning the correct tokens, this type of relationship could occur. However, this thought falls apart with words that have more than one meaning, such as the wordbank
, for example.The second problem is that neural networks internally do a lot of numerical calculations, so it could be the case that if mesa has token 3, it has internally more importance than the word cat which has token 1.
So this type of word representation can be discarded very quickly.
One hot encoding
Here what is done is to use N
dimensional vectors. For example we saw that OpenAI has a vocabulary of 100277
distinct tokens. So if we use one hot encoding
, each word would be represented with a vector of 100277
dimensions.
However, the one hot encoding has two other major problems
- It does not take into account the relationship between words. So if we have two words that are synonyms, such as
cat
andfeline
, we would have two different vectors to represent them.
In language the relationship between words is very important, and not taking this relationship into account is a big problem.
- The second problem is that vectors are very large. If we have a vocabulary of
100277
tokens, each word would be represented by a vector of100277
dimensions. This makes the vectors very large and computationally very expensive. In addition these vectors are going to be all zeros, except in the position corresponding to the word token. So most of the calculations are going to be multiplications by zero, which are calculations that don't add anything. So we are going to have a lot of memory allocated to vectors where you only have a1
at a given position.
Word embeddings
With word embeddings we try to solve the problems of the two previous types of representations. For this purpose vectors of N
dimensions are used, but in this case vectors of 100277 dimensions are not used, but vectors of much less dimensions are used. For example we will see that OpenAI uses 1536
dimensions.
Each of the dimensions of these vectors represents a characteristic of the word. For example one of the dimensions could represent whether the word is a verb or a noun. Another dimension might represent whether the word is an animal or not. Another dimension might represent whether the word is a proper noun or not. And so on.
However, these features are not defined by hand, but are learned automatically. During the training of the transformers, the values of each of the dimensions of the vectors are adjusted, so that the characteristics of each of the words are learned.
By making each of the word dimensions represent a characteristic of the word, words that have similar characteristics will have similar vectors. For example the words cat
and feline
will have very similar vectors, since they are both animals. And the words table
and chair
will have similar vectors, since both are furniture.
In the following image we can see a 3-dimensional representation of words, and we can see that all words related to school
are close, all words related to food
are close and all words related to ball
are close.
Having each of the dimensions of the vectors represent a characteristic of the word allows us to perform operations with words. For example, if we subtract the word king
from the word man
and add the word woman
, we get a word very similar to the word queen
. We will check it later with an example
Similarity between words
As each word is represented by a vector of N
dimensions, we can calculate the similarity between two words. The cosine similarity function or cosine similarity
is used for this purpose.
If two words are close in vector space, it means that the angle between their vectors is small, so their cosine is close to 1. If there is an angle of 90 degrees between the vectors, the cosine is 0, meaning that there is no similarity between the words. And if there is an angle of 180 degrees between the vectors, the cosine is -1, that is, the words are opposites.
Example with OpenAI embeddings
Now that we know what embeddings
are, let's see some examples with the embeddings
provided by the API
of OpenAI
.
To do this we first need to have the OpenAI
package installed.
pip install openai
```
We import the necessary libraries
from openai import OpenAIimport torchfrom torch.nn.functional import cosine_similarity
We use an OpenAI API key
. To do this, go to the OpenAI page, and register. Once registered, go to the API Keys section, and create a new API Key
.
from openai import OpenAIimport torchfrom torch.nn.functional import cosine_similarityapi_key = "Pon aquí tu API key"
We select which embeddings model we want to use. In this case we are going to use text-embedding-ada-002
which is the one recommended by OpenAI
in its embeddings documentation.
from openai import OpenAIimport torchfrom torch.nn.functional import cosine_similarityapi_key = "Pon aquí tu API key"model_openai = "text-embedding-ada-002"
Create an API
client
from openai import OpenAIimport torchfrom torch.nn.functional import cosine_similarityapi_key = "Pon aquí tu API key"model_openai = "text-embedding-ada-002"client_openai = OpenAI(api_key=api_key, organization=None)
Let's see how are the embeddings
of the word King
.
from openai import OpenAIimport torchfrom torch.nn.functional import cosine_similarityapi_key = "Pon aquí tu API key"model_openai = "text-embedding-ada-002"client_openai = OpenAI(api_key=api_key, organization=None)word = "Rey"embedding_openai = torch.Tensor(client_openai.embeddings.create(input=word, model=model_openai).data[0].embedding)embedding_openai.shape, embedding_openai
(torch.Size([1536]),tensor([-0.0103, -0.0005, -0.0189, ..., -0.0009, -0.0226, 0.0045]))
As we can see we obtain a vector of 1536
dimensions
Operations with words
Let's get the embeddings of the words king
, man
, woman
and queen
.
embedding_openai_rey = torch.Tensor(client_openai.embeddings.create(input="rey", model=model_openai).data[0].embedding)embedding_openai_hombre = torch.Tensor(client_openai.embeddings.create(input="hombre", model=model_openai).data[0].embedding)embedding_openai_mujer = torch.Tensor(client_openai.embeddings.create(input="mujer", model=model_openai).data[0].embedding)embedding_openai_reina = torch.Tensor(client_openai.embeddings.create(input="reina", model=model_openai).data[0].embedding)
embedding_openai_rey = torch.Tensor(client_openai.embeddings.create(input="rey", model=model_openai).data[0].embedding)embedding_openai_hombre = torch.Tensor(client_openai.embeddings.create(input="hombre", model=model_openai).data[0].embedding)embedding_openai_mujer = torch.Tensor(client_openai.embeddings.create(input="mujer", model=model_openai).data[0].embedding)embedding_openai_reina = torch.Tensor(client_openai.embeddings.create(input="reina", model=model_openai).data[0].embedding)embedding_openai_reina.shape, embedding_openai_reina
(torch.Size([1536]),tensor([-0.0110, -0.0084, -0.0115, ..., 0.0082, -0.0096, -0.0024]))
Let's obtain the embedding resulting from subtracting the embedding of man
from king
and adding the embedding of woman
to king
.
embedding_openai = embedding_openai_rey - embedding_openai_hombre + embedding_openai_mujer
embedding_openai = embedding_openai_rey - embedding_openai_hombre + embedding_openai_mujerembedding_openai.shape, embedding_openai
(torch.Size([1536]),tensor([-0.0226, -0.0323, 0.0017, ..., 0.0014, -0.0290, -0.0188]))
Finally we compare the result obtained with the embedding of reina
. For this we use the cosine_similarity
function provided by the pytorch
library.
similarity_openai = cosine_similarity(embedding_openai.unsqueeze(0), embedding_openai_reina.unsqueeze(0)).item()print(f"similarity_openai: {similarity_openai}")
similarity_openai: 0.7564167976379395
As we can see it is a value very close to 1, so we can say that the result obtained is very similar to the embedding of reina
.
If we use English words, we get a result closer to 1.
embedding_openai_rey = torch.Tensor(client_openai.embeddings.create(input="king", model=model_openai).data[0].embedding)embedding_openai_hombre = torch.Tensor(client_openai.embeddings.create(input="man", model=model_openai).data[0].embedding)embedding_openai_mujer = torch.Tensor(client_openai.embeddings.create(input="woman", model=model_openai).data[0].embedding)embedding_openai_reina = torch.Tensor(client_openai.embeddings.create(input="queen", model=model_openai).data[0].embedding)
embedding_openai_rey = torch.Tensor(client_openai.embeddings.create(input="king", model=model_openai).data[0].embedding)embedding_openai_hombre = torch.Tensor(client_openai.embeddings.create(input="man", model=model_openai).data[0].embedding)embedding_openai_mujer = torch.Tensor(client_openai.embeddings.create(input="woman", model=model_openai).data[0].embedding)embedding_openai_reina = torch.Tensor(client_openai.embeddings.create(input="queen", model=model_openai).data[0].embedding)embedding_openai = embedding_openai_rey - embedding_openai_hombre + embedding_openai_mujer
embedding_openai_rey = torch.Tensor(client_openai.embeddings.create(input="king", model=model_openai).data[0].embedding)embedding_openai_hombre = torch.Tensor(client_openai.embeddings.create(input="man", model=model_openai).data[0].embedding)embedding_openai_mujer = torch.Tensor(client_openai.embeddings.create(input="woman", model=model_openai).data[0].embedding)embedding_openai_reina = torch.Tensor(client_openai.embeddings.create(input="queen", model=model_openai).data[0].embedding)embedding_openai = embedding_openai_rey - embedding_openai_hombre + embedding_openai_mujersimilarity_openai = cosine_similarity(embedding_openai.unsqueeze(0), embedding_openai_reina.unsqueeze(0))print(f"similarity_openai: {similarity_openai}")
similarity_openai: tensor([0.8849])
This is normal, since the OpenAi model has been trained with more texts in English than in Spanish.
Types of word embeddings
There are several types of word embeddings, and each of them has its advantages and disadvantages. Let's take a look at the most important ones
- Word2Vec
- GloVe
- FastText
- BERT
- GPT-2
Word2Vec
Word2Vec is an algorithm used to create word embeddings. This algorithm was created by Google in 2013, and it is one of the most used algorithms to create word embeddings.
It has two variants, CBOW
and Skip-gram
. CBOW
is faster to train, while Skip-gram
is more accurate. Let's see how each of them works
CBOW
CBOWor
Continuous Bag of Wordsis an algorithm used to predict a word from the surrounding words. For example if we have the sentence
The cat is an animal, the algorithm will try to predict the word
catfrom the surrounding words, in this case
The,
is,
anand
animal`.
In this architecture, the model predicts which is the most likely word in the given context. Therefore, words that have the same probability of appearing are considered similar and are therefore closer in dimensional space.
Suppose that in a sentence we replace boat
with boat
, then the model predicts the probability for both and if it turns out to be similar then we can consider that the words are similar.
Skip-gram
Skip-gram
or Skip-gram with Negative Sampling
is an algorithm used to predict the words surrounding a word. For example if we have the sentence The cat is an animal
, the algorithm will try to predict the words The
, is
, an
and animal
from the word cat
.
This architecture is similar to that of CBOW, but instead the model works backwards. The model predicts the context using the given word. Therefore, words that have the same context are considered similar and are therefore closer in dimensional space.
GloVe
GloVeor
Global Vectors for Word Representation` is an algorithm used to create word embeddings. This algorithm was created by Stanford University in 2014.
Word2Vec ignores the fact that some context words occur more frequently than others and also only take into account the local context and therefore do not capture the global context.
This algorithm uses a co-occurrence matrix to create the word embeddings. This co-occurrence matrix is a matrix that contains the number of times each word appears next to each of the other words in the vocabulary.
FastText
FastText
is an algorithm that is used to create word embeddings. This algorithm was created by Facebook in 2016.
One of the main disadvantages of Word2Vec
and GloVe
is that they cannot encode unknown or out-of-vocabulary words.
So, to deal with this problem, Facebook proposed a FastText
model. It is an extension of Word2Vec
and follows the same Skip-gram
and CBOW
model. But unlike Word2Vec
which feeds whole words into the neural network, FastText
first splits words into several subwords (or n-grams
) and then feeds them to the neural network.
For example, if the value of n
is 3 and the word is apple
then your tri-gram will be [<ma
, man
, anz
, nza
, zan
, ana
, na>
] and your word embedding will be the sum of the vector representation of these tri-grams. Here, the hyperparameters min_n
and max_n
are considered as 3 and the characters <
and >
represent the beginning and end of the word.
Therefore, using this methodology, unknown words can be represented in vector form, since it has a high probability that their n-grams
are also present in other words.
This algorithm is an improvement of Word2Vec
, since in addition to taking into account the words surrounding a word, it also takes into account the n-grams
of the word. For example if we have the word cat
, it also takes into account the n-grams
of the word, in this case ga
, at
and to
, for n = 2
.
Limitations of word embeddings
Word embedding techniques have given a decent result, but the problem is that the approach is not precise enough. They do not take into account the order of the words in which they appear, which leads to loss of syntactic and semantic understanding of the sentence.
For example, You go there to teach, not to play
AND You go there to play, not to teach
Both sentences will have the same representation in vector space, but they do not mean the same thing.
In addition, the word embedding model cannot give satisfactory results on a large amount of text data, since the same word may have a different meaning in a different sentence depending on the context of the sentence.
For example, I am going to sit in the bank
AND I am going to do business in the bank
In both sentences, the word bank
has different meanings.
Therefore, we require a type of representation that can retain the contextual meaning of the word present in a sentence.
Sentence embeddings
Sentence embedding is similar to word embedding, but instead of words, they encode the whole sentence in the vector representation.
A simple way to obtain sentence embedding is to average the word embedding of all the words present in the sentence. But they are not accurate enough.
Some of the most advanced models for sentence embedding are ELMo
, InferSent
and Sentence-BERT
.
ELMo
ELMo
or Embeddings from Language Models
is a sentence embedding model that was created by Allen University in 2018. It uses a bidirectional deep LSTM network to produce vector representation. ELMo
can represent unknown or out-of-vocabulary words in vector form since it is character-based.
InferSent
InferSent
is a sentence embedding model that was created by Facebook in 2017. It uses a bidirectional deep LSTM network to produce vector representation. InferSent
can represent unknown or out-of-vocabulary words in vector form as it is character-based. Sentences are encoded in a 4096-dimensional vector representation.
The training of the model is performed on the Stanford Natural Language Inference (SNLI
) dataset. This dataset is labeled and written by humans for about 500K sentence pairs.
Sentence-BERT
Sentence-BERT is a sentence embedding model that was created by the University of London in 2019. It uses a bidirectional deep LSTM network to produce vector representation. Sentence-BERT
can represent unknown or out-of-vocabulary words in vector form as it is character-based. Sentences are encoded in a 768-dimensional vector representation.
The state-of-the-art NLP model BERT
is excellent in Semantic Textual Similarity tasks, but the problem is that it would take a long time for a huge corpus (65 hours for 10,000 sentences), as it requires both sentences to be entered into the network and this increases the computation by a huge factor.
Therefore, Sentence-BERT
is a modification of the BERT
model.
Training of a word2vec model with gensim
To download the dataset we are going to use, the dataset
library of huggingface must be installed:
pip install datasets
To train the embeddings model we are going to use the gensim
library. To install it with conda we use
conda install -c conda-forge gensim
And to install it with pip we use
pip install gensim
To clean the dataset we have downloaded we are going to use regular expressions, which is usually already installed in python, and nltk
which is a natural language processing library. To install it with conda we use
conda install -c anaconda nltk
And to install it with pip we use
pip install nltk
Now that we have everything installed we can import the libraries we are going to use:
from gensim.models import Word2Vecfrom gensim.parsing.preprocessing import strip_punctuation, strip_numeric, strip_shortimport refrom nltk.corpus import stopwordsfrom nltk.tokenize import word_tokenize
Download dataset
We are going to download a dataset of texts from wikipedia in Spanish, for this we execute the following:
from gensim.models import Word2Vecfrom gensim.parsing.preprocessing import strip_punctuation, strip_numeric, strip_shortimport refrom nltk.corpus import stopwordsfrom nltk.tokenize import word_tokenizefrom datasets import load_datasetdataset_corpus = load_dataset('large_spanish_corpus', name='all_wikis')
Let's see what it looks like
from gensim.models import Word2Vecfrom gensim.parsing.preprocessing import strip_punctuation, strip_numeric, strip_shortimport refrom nltk.corpus import stopwordsfrom nltk.tokenize import word_tokenizefrom datasets import load_datasetdataset_corpus = load_dataset('large_spanish_corpus', name='all_wikis')dataset_corpus
DatasetDict({train: Dataset({features: ['text'],num_rows: 28109484})})
As we can see, the dataset has more than 28 million texts. Let's take a look at some of them:
dataset_corpus['train']['text'][0:10]
['¡Bienvenidos!','Ir a los contenidos»','= Contenidos =','','Portada','Tercera Lengua más hablada en el mundo.','La segunda en número de habitantes en el mundo occidental.','La de mayor proyección y crecimiento día a día.','El español es, hoy en día, nombrado en cada vez más contextos, tomando realce internacional como lengua de cultura y civilización siempre de mayor envergadura.','Ejemplo de ello es que la comunidad minoritaria más hablada en los Estados Unidos es precisamente la que habla idioma español.']
As there are many examples we will create a subset of 10 million examples to work faster:
subset = dataset_corpus['train'].select(range(10000000))
Dataset cleaning
Now we download the stopwords
from nltk
, which are words that do not provide information and that we are going to eliminate from the texts.
import nltk
nltk.download('stopwords')
Now we are going to download the punkt
of nltk
, which is a tokenizer
that will allow us to separate the texts into sentences
nltk.download('punkt')
We create a function to clean the data, this function is going to:
- Change the text to lower case
- Remove urls
- Remove mentions to social networks such as
@twitter
p#hashtag
. - Eliminate punctuation marks
- Eliminate numbers
- Eliminate short words
- Eliminate stopwords
As we are using a huggeface dataset, the texts are in dict
format, so we return a dictionary.
subset = dataset_corpus['train'].select(range(10000000))import nltknltk.download('stopwords')nltk.download('punkt')def clean_text(sentence_batch):# extrae el texto de la entradatext_list = sentence_batch['text']cleaned_text_list = []for text in text_list:# Convierte el texto a minúsculastext = text.lower()# Elimina URLstext = re.sub(r'httpS+|wwwS+|httpsS+', '', text, flags=re.MULTILINE)# Elimina las menciones @ y '#' de las redes socialestext = re.sub(r'@w+|#w+', '', text)# Elimina los caracteres de puntuacióntext = strip_punctuation(text)# Elimina los númerostext = strip_numeric(text)# Elimina las palabras cortastext = strip_short(text,minsize=2)# Elimina las palabras comunes (stop words)stop_words = set(stopwords.words('spanish'))word_tokens = word_tokenize(text)filtered_text = [word for word in word_tokens if word not in stop_words]cleaned_text_list.append(filtered_text)# Devuelve el texto limpioreturn {'text': cleaned_text_list}
We apply the function to the data
subset = dataset_corpus['train'].select(range(10000000))import nltknltk.download('stopwords')nltk.download('punkt')def clean_text(sentence_batch):# extrae el texto de la entradatext_list = sentence_batch['text']cleaned_text_list = []for text in text_list:# Convierte el texto a minúsculastext = text.lower()# Elimina URLstext = re.sub(r'httpS+|wwwS+|httpsS+', '', text, flags=re.MULTILINE)# Elimina las menciones @ y '#' de las redes socialestext = re.sub(r'@w+|#w+', '', text)# Elimina los caracteres de puntuacióntext = strip_punctuation(text)# Elimina los númerostext = strip_numeric(text)# Elimina las palabras cortastext = strip_short(text,minsize=2)# Elimina las palabras comunes (stop words)stop_words = set(stopwords.words('spanish'))word_tokens = word_tokenize(text)filtered_text = [word for word in word_tokens if word not in stop_words]cleaned_text_list.append(filtered_text)# Devuelve el texto limpioreturn {'text': cleaned_text_list}sentences_corpus = subset.map(clean_text, batched=True)
[nltk_data] Downloading package stopwords to[nltk_data] /home/wallabot/nltk_data...[nltk_data] Package stopwords is already up-to-date![nltk_data] Downloading package punkt to /home/wallabot/nltk_data...[nltk_data] Package punkt is already up-to-date!Map: 0%| | 0/10000000 [00:00<?, ? examples/s]
Let's save the filtered dataset in a file to avoid having to run the cleaning process again.
sentences_corpus.save_to_disk("sentences_corpus")
Saving the dataset (0/4 shards): 0%| | 0/15000000 [00:00<?, ? examples/s]
To load it we can do
from datasets import load_from_disksentences_corpus = load_from_disk('sentences_corpus')
Now what we are going to have is a list of lists, where each list is a tokenized phrase and without stopwords. That is, we have a list of phrases, and each phrase is a list of words. Let's see what it looks like:
from datasets import load_from_disksentences_corpus = load_from_disk('sentences_corpus')for i in range(10):print(f'La frase "{subset["text"][i]}" se convierte en la lista de palabras "{sentences_corpus["text"][i]}"')
La frase "¡Bienvenidos!" se convierte en la lista de palabras "['¡bienvenidos']"La frase "Ir a los contenidos»" se convierte en la lista de palabras "['ir', 'contenidos', '»']"La frase "= Contenidos =" se convierte en la lista de palabras "['contenidos']"La frase "" se convierte en la lista de palabras "[]"La frase "Portada" se convierte en la lista de palabras "['portada']"La frase "Tercera Lengua más hablada en el mundo." se convierte en la lista de palabras "['tercera', 'lengua', 'hablada', 'mundo']"La frase "La segunda en número de habitantes en el mundo occidental." se convierte en la lista de palabras "['segunda', 'número', 'habitantes', 'mundo', 'occidental']"La frase "La de mayor proyección y crecimiento día a día." se convierte en la lista de palabras "['mayor', 'proyección', 'crecimiento', 'día', 'día']"La frase "El español es, hoy en día, nombrado en cada vez más contextos, tomando realce internacional como lengua de cultura y civilización siempre de mayor envergadura." se convierte en la lista de palabras "['español', 'hoy', 'día', 'nombrado', 'cada', 'vez', 'contextos', 'tomando', 'realce', 'internacional', 'lengua', 'cultura', 'civilización', 'siempre', 'mayor', 'envergadura']"La frase "Ejemplo de ello es que la comunidad minoritaria más hablada en los Estados Unidos es precisamente la que habla idioma español." se convierte en la lista de palabras "['ejemplo', 'ello', 'comunidad', 'minoritaria', 'hablada', 'unidos', 'precisamente', 'habla', 'idioma', 'español']"
Training of the word2vec model
We are going to train an embeddings model that will convert words into vectors. For this we are going to use the gensim
library and its Word2Vec
model.
dataset = sentences_corpus['text']dim_embedding = 100window_size = 5 # 5 palabras a la izquierda y 5 palabras a la derechamin_count = 5 # Ignora las palabras con frecuencia menor a 5workers = 4 # Número de hilos de ejecuciónsg = 1 # 0 para CBOW, 1 para Skip-grammodel = Word2Vec(dataset, vector_size=dim_embedding, window=window_size, min_count=min_count, workers=workers, sg=sg)
This model has been trained on the CPU, since gensim
does not have the option to perform the training on the GPU and even so on my computer it has taken X minutes to train the model. Although the size of the embedding we have chosen is only 100 (as opposed to the size of the openai embeddings which is 1536), it is not a very long time, since the dataset has 10 million sentences.
Large language models are trained with datasets of billions of sentences, so it is normal that training an embeddings model with a dataset of 10 million sentences takes a few minutes.
Once the model is trained we save it in a file for future use.
dataset = sentences_corpus['text']dim_embedding = 100window_size = 5 # 5 palabras a la izquierda y 5 palabras a la derechamin_count = 5 # Ignora las palabras con frecuencia menor a 5workers = 4 # Número de hilos de ejecuciónsg = 1 # 0 para CBOW, 1 para Skip-grammodel = Word2Vec(dataset, vector_size=dim_embedding, window=window_size, min_count=min_count, workers=workers, sg=sg)model.save('word2vec.model')
If we would like to load it in the future, we can do so with
dataset = sentences_corpus['text']dim_embedding = 100window_size = 5 # 5 palabras a la izquierda y 5 palabras a la derechamin_count = 5 # Ignora las palabras con frecuencia menor a 5workers = 4 # Número de hilos de ejecuciónsg = 1 # 0 para CBOW, 1 para Skip-grammodel = Word2Vec(dataset, vector_size=dim_embedding, window=window_size, min_count=min_count, workers=workers, sg=sg)model.save('word2vec.model')model = Word2Vec.load('word2vec.model')
Evaluation of the word2vec model
Let's see the most similar words of some words
dataset = sentences_corpus['text']dim_embedding = 100window_size = 5 # 5 palabras a la izquierda y 5 palabras a la derechamin_count = 5 # Ignora las palabras con frecuencia menor a 5workers = 4 # Número de hilos de ejecuciónsg = 1 # 0 para CBOW, 1 para Skip-grammodel = Word2Vec(dataset, vector_size=dim_embedding, window=window_size, min_count=min_count, workers=workers, sg=sg)model.save('word2vec.model')model = Word2Vec.load('word2vec.model')model.wv.most_similar('perro', topn=10)
[('gato', 0.7948548197746277),('perros', 0.77247554063797),('cachorro', 0.7638891339302063),('hámster', 0.7540281414985657),('caniche', 0.7514827251434326),('bobtail', 0.7492328882217407),('mastín', 0.7491254210472107),('lobo', 0.7312178611755371),('semental', 0.7292628288269043),('sabueso', 0.7290207147598267)]
model.wv.most_similar('gato', topn=10)
[('conejo', 0.8148329854011536),('zorro', 0.8109457492828369),('perro', 0.7948548793792725),('lobo', 0.7878773808479309),('ardilla', 0.7860757112503052),('mapache', 0.7817519307136536),('huiña', 0.766639232635498),('oso', 0.7656188011169434),('mono', 0.7633568644523621),('camaleón', 0.7623056769371033)]
Now let's see the example in which we check the similarity of the word queen
with the result of subtracting the word man
from the word king
and adding the word woman
to the word king
.
embedding_hombre = model.wv['hombre']embedding_mujer = model.wv['mujer']embedding_rey = model.wv['rey']embedding_reina = model.wv['reina']
embedding_hombre = model.wv['hombre']embedding_mujer = model.wv['mujer']embedding_rey = model.wv['rey']embedding_reina = model.wv['reina']embedding = embedding_rey - embedding_hombre + embedding_mujer
embedding_hombre = model.wv['hombre']embedding_mujer = model.wv['mujer']embedding_rey = model.wv['rey']embedding_reina = model.wv['reina']embedding = embedding_rey - embedding_hombre + embedding_mujerfrom torch.nn.functional import cosine_similarityembedding = torch.tensor(embedding).unsqueeze(0)embedding_reina = torch.tensor(embedding_reina).unsqueeze(0)similarity = cosine_similarity(embedding, embedding_reina, dim=1)similarity
tensor([0.8156])
As we can see, there is a lot of similarity
Display of embeddings
We are going to visualize the embeddings, for this we first obtain the vectors and the words of the model
embeddings = model.wv.vectorswords = list(model.wv.index_to_key)
As the dimension of the embeddings is 100, to be able to visualize them in 2 or 3 dimensions we have to reduce the dimension. For this we will use PCA
(faster) or TSNE
(more precise) of sklearn
.
embeddings = model.wv.vectorswords = list(model.wv.index_to_key)from sklearn.decomposition import PCAdimmesions = 2pca = PCA(n_components=dimmesions)reduced_embeddings_PCA = pca.fit_transform(embeddings)
embeddings = model.wv.vectorswords = list(model.wv.index_to_key)from sklearn.decomposition import PCAdimmesions = 2pca = PCA(n_components=dimmesions)reduced_embeddings_PCA = pca.fit_transform(embeddings)from sklearn.manifold import TSNEdimmesions = 2tsne = TSNE(n_components=dimmesions, verbose=1, perplexity=40, n_iter=300)reduced_embeddings_tsne = tsne.fit_transform(embeddings)
[t-SNE] Computing 121 nearest neighbors...[t-SNE] Indexed 493923 samples in 0.013s...[t-SNE] Computed neighbors for 493923 samples in 377.143s...[t-SNE] Computed conditional probabilities for sample 1000 / 493923[t-SNE] Computed conditional probabilities for sample 2000 / 493923[t-SNE] Computed conditional probabilities for sample 3000 / 493923[t-SNE] Computed conditional probabilities for sample 4000 / 493923[t-SNE] Computed conditional probabilities for sample 5000 / 493923[t-SNE] Computed conditional probabilities for sample 6000 / 493923[t-SNE] Computed conditional probabilities for sample 7000 / 493923[t-SNE] Computed conditional probabilities for sample 8000 / 493923[t-SNE] Computed conditional probabilities for sample 9000 / 493923[t-SNE] Computed conditional probabilities for sample 10000 / 493923[t-SNE] Computed conditional probabilities for sample 11000 / 493923[t-SNE] Computed conditional probabilities for sample 12000 / 493923[t-SNE] Computed conditional probabilities for sample 13000 / 493923[t-SNE] Computed conditional probabilities for sample 14000 / 493923[t-SNE] Computed conditional probabilities for sample 15000 / 493923[t-SNE] Computed conditional probabilities for sample 16000 / 493923[t-SNE] Computed conditional probabilities for sample 17000 / 493923[t-SNE] Computed conditional probabilities for sample 18000 / 493923[t-SNE] Computed conditional probabilities for sample 19000 / 493923[t-SNE] Computed conditional probabilities for sample 20000 / 493923[t-SNE] Computed conditional probabilities for sample 21000 / 493923[t-SNE] Computed conditional probabilities for sample 22000 / 493923...[t-SNE] Computed conditional probabilities for sample 493923 / 493923[t-SNE] Mean sigma: 0.275311[t-SNE] KL divergence after 250 iterations with early exaggeration: 117.413788[t-SNE] KL divergence after 300 iterations: 5.774648
Now we visualize them in 2 dimensions with matplotlib
. Let's visualize the dimensionality reduction we have done with PCA
and with TSNE
.
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 10))
for i, word in enumerate(words[:200]): # Limitar a las primeras 200 palabras
plt.scatter(reduced_embeddings_PCA[i, 0], reduced_embeddings_PCA[i, 1])
plt.annotate(word, xy=(reduced_embeddings_PCA[i, 0], reduced_embeddings_PCA[i, 1]), xytext=(5, 2), textcoords='offset points', ha='right', va='bottom')
plt.title('Embeddings (PCA)')
plt.show()
plt.figure(figsize=(10, 10))
for i, word in enumerate(words[:200]): # Limitar a las primeras 200 palabras
plt.scatter(reduced_embeddings_tsne[i, 0], reduced_embeddings_tsne[i, 1])
plt.annotate(word, xy=(reduced_embeddings_tsne[i, 0], reduced_embeddings_tsne[i, 1]), xytext=(5, 2),
textcoords='offset points', ha='right', va='bottom')
plt.show()
Use of pre-trained models with huggingface
To use pre-trained embeddings
models we will use the transformers
library from huggingface
. To install it with conda we use
conda install -c conda-forge transformers
And to install it with pip we use
pip install transformers
With the feature-extraction
task of huggingface
we can use pre-trained models to obtain the embeddings of the words. To do this we first import the necessary library
import matplotlib.pyplot as pltplt.figure(figsize=(10, 10))for i, word in enumerate(words[:200]): # Limitar a las primeras 200 palabrasplt.scatter(reduced_embeddings_PCA[i, 0], reduced_embeddings_PCA[i, 1])plt.annotate(word, xy=(reduced_embeddings_PCA[i, 0], reduced_embeddings_PCA[i, 1]), xytext=(5, 2), textcoords='offset points', ha='right', va='bottom')plt.title('Embeddings (PCA)')plt.show()plt.figure(figsize=(10, 10))for i, word in enumerate(words[:200]): # Limitar a las primeras 200 palabrasplt.scatter(reduced_embeddings_tsne[i, 0], reduced_embeddings_tsne[i, 1])plt.annotate(word, xy=(reduced_embeddings_tsne[i, 0], reduced_embeddings_tsne[i, 1]), xytext=(5, 2),textcoords='offset points', ha='right', va='bottom')plt.show()from transformers import pipeline
Let's get the embeddings
from BERT
.
import matplotlib.pyplot as pltplt.figure(figsize=(10, 10))for i, word in enumerate(words[:200]): # Limitar a las primeras 200 palabrasplt.scatter(reduced_embeddings_PCA[i, 0], reduced_embeddings_PCA[i, 1])plt.annotate(word, xy=(reduced_embeddings_PCA[i, 0], reduced_embeddings_PCA[i, 1]), xytext=(5, 2), textcoords='offset points', ha='right', va='bottom')plt.title('Embeddings (PCA)')plt.show()plt.figure(figsize=(10, 10))for i, word in enumerate(words[:200]): # Limitar a las primeras 200 palabrasplt.scatter(reduced_embeddings_tsne[i, 0], reduced_embeddings_tsne[i, 1])plt.annotate(word, xy=(reduced_embeddings_tsne[i, 0], reduced_embeddings_tsne[i, 1]), xytext=(5, 2),textcoords='offset points', ha='right', va='bottom')plt.show()from transformers import pipelinecheckpoint = "bert-base-uncased"feature_extractor = pipeline("feature-extraction",framework="pt",model=checkpoint)
Let's see the embeddings
of the word king
.
import matplotlib.pyplot as pltplt.figure(figsize=(10, 10))for i, word in enumerate(words[:200]): # Limitar a las primeras 200 palabrasplt.scatter(reduced_embeddings_PCA[i, 0], reduced_embeddings_PCA[i, 1])plt.annotate(word, xy=(reduced_embeddings_PCA[i, 0], reduced_embeddings_PCA[i, 1]), xytext=(5, 2), textcoords='offset points', ha='right', va='bottom')plt.title('Embeddings (PCA)')plt.show()plt.figure(figsize=(10, 10))for i, word in enumerate(words[:200]): # Limitar a las primeras 200 palabrasplt.scatter(reduced_embeddings_tsne[i, 0], reduced_embeddings_tsne[i, 1])plt.annotate(word, xy=(reduced_embeddings_tsne[i, 0], reduced_embeddings_tsne[i, 1]), xytext=(5, 2),textcoords='offset points', ha='right', va='bottom')plt.show()from transformers import pipelinecheckpoint = "bert-base-uncased"feature_extractor = pipeline("feature-extraction",framework="pt",model=checkpoint)embedding = feature_extractor("rey", return_tensors="pt").squeeze(0)embedding.shape
torch.Size([3, 768])
As we see we obtain a vector of 768
dimensions, that is to say, the embeddings
of BERT
have 768
dimensions. On the other hand we see that it has 3 vectors of embeddings
, this is because BERT
adds a token at the beginning and another at the end of the sentence, so we are only interested in the middle vector.
Let us redo the example in which we check the similarity of the word queen
with the result of subtracting the word man
from the word king
and adding the word woman
to the word king
.
embedding_hombre = feature_extractor("man", return_tensors="pt").squeeze(0)[1]embedding_mujer = feature_extractor("woman", return_tensors="pt").squeeze(0)[1]embedding_rey = feature_extractor("king", return_tensors="pt").squeeze(0)[1]embedding_reina = feature_extractor("queen", return_tensors="pt").squeeze(0)[1]
embedding_hombre = feature_extractor("man", return_tensors="pt").squeeze(0)[1]embedding_mujer = feature_extractor("woman", return_tensors="pt").squeeze(0)[1]embedding_rey = feature_extractor("king", return_tensors="pt").squeeze(0)[1]embedding_reina = feature_extractor("queen", return_tensors="pt").squeeze(0)[1]embedding = embedding_rey - embedding_hombre + embedding_mujer
Let's see the similarity
import torch
from torch.nn.functional import cosine_similarity
embedding = torch.tensor(embedding).unsqueeze(0)
embedding_reina = torch.tensor(embedding_reina).unsqueeze(0)
similarity = cosine_similarity(embedding, embedding_reina, dim=1)
similarity.item()
Using the embeddings
of BERT
we also get a result very close to 1