In a previous post about tokens, we already saw the minimum representation of each word. Which corresponds to giving a number to the minimum division of each word.
However, the transformers, and therefore the LLMs, do not represent the information of the words in this way, but do so by means of embeddings
.
This notebook has been automatically translated to make it accessible to more people, please let me know if you see any typos.
We are going to see first two ways of representing words, the ordinal encoding
and the one hot encoding
. And seeing the problems of these two types of representations we will be able to get to word embeddings
and sentence embeddings
.
We will also see an example of how to train a word embeddings
model with the gensim
library.
And finally we will see how to use pre-trained models of embeddings
with the transformers
library of HuggingFace.
This is the most basic way to represent the words inside the transformers. It consists of giving a number to each word, or keeping the numbers already assigned to the tokens.
However, this type of representation has two problems
- Let us imagine that table corresponds to token 3, cat to token 1 and dog to token 2. One could assume that
table = cat + dog
, but it is not so. There is no such relationship between these words. We might even think that by assigning the correct tokens, this type of relationship could occur. However, this thought falls apart with words that have more than one meaning, such as the wordbank
, for example.
- The second problem is that neural networks internally do a lot of numerical calculations, so it could be the case that if mesa has token 3, it has internally more importance than the word cat which has token 1.
So this type of word representation can be discarded very quickly.
Here what is done is to use N
dimensional vectors. For example we saw that OpenAI has a vocabulary of 100277
distinct tokens. So if we use one hot encoding
, each word would be represented with a vector of 100277
dimensions.
However, the one hot encoding has two other major problems
- It does not take into account the relationship between words. So if we have two words that are synonyms, such as
cat
andfeline
, we would have two different vectors to represent them.
In language the relationship between words is very important, and not taking this relationship into account is a big problem.
- The second problem is that vectors are very large. If we have a vocabulary of
100277
tokens, each word would be represented by a vector of100277
dimensions. This makes the vectors very large and computationally very expensive. In addition these vectors are going to be all zeros, except in the position corresponding to the word token. So most of the calculations are going to be multiplications by zero, which are calculations that don’t add anything. So we are going to have a lot of memory allocated to vectors where you only have a1
at a given position.
With word embeddings we try to solve the problems of the two previous types of representations. For this purpose vectors of N
dimensions are used, but in this case vectors of 100277 dimensions are not used, but vectors of much less dimensions are used. For example we will see that OpenAI uses 1536
dimensions.
Each of the dimensions of these vectors represents a characteristic of the word. For example one of the dimensions could represent whether the word is a verb or a noun. Another dimension might represent whether the word is an animal or not. Another dimension might represent whether the word is a proper noun or not. And so on.
However, these features are not defined by hand, but are learned automatically. During the training of the transformers, the values of each of the dimensions of the vectors are adjusted, so that the characteristics of each of the words are learned.
By making each of the word dimensions represent a characteristic of the word, words that have similar characteristics will have similar vectors. For example the words cat
and feline
will have very similar vectors, since they are both animals. And the words table
and chair
will have similar vectors, since both are furniture.
In the following image we can see a 3-dimensional representation of words, and we can see that all words related to school
are close, all words related to food
are close and all words related to ball
are close.
Having each of the dimensions of the vectors represent a characteristic of the word allows us to perform operations with words. For example, if we subtract the word king
from the word man
and add the word woman
, we get a word very similar to the word queen
. We will check it later with an example
As each word is represented by a vector of N
dimensions, we can calculate the similarity between two words. The cosine similarity function or cosine similarity
is used for this purpose.
If two words are close in vector space, it means that the angle between their vectors is small, so their cosine is close to 1. If there is an angle of 90 degrees between the vectors, the cosine is 0, meaning that there is no similarity between the words. And if there is an angle of 180 degrees between the vectors, the cosine is -1, that is, the words are opposites.
Now that we know what embeddings
are, let’s see some examples with the embeddings
provided by the API
of OpenAI
.
To do this we first need to have the OpenAI
package installed.
pip install openai
We import the necessary libraries
from openai import OpenAI
import torch
from torch.nn.functional import cosine_similarity
api_key = "Pon aquí tu API key"
We select which embeddings model we want to use. In this case we are going to use text-embedding-ada-002
which is the one recommended by OpenAI
in its embeddings documentation.
model_openai = "text-embedding-ada-002"
Create an API
client
client_openai = OpenAI(api_key=api_key, organization=None)
Let’s see how are the embeddings
of the word King
.
word = "Rey"
embedding_openai = torch.Tensor(client_openai.embeddings.create(input=word, model=model_openai).data[0].embedding)
embedding_openai.shape, embedding_openai
As we can see we obtain a vector of 1536
dimensions
Let’s get the embeddings of the words king
, man
, woman
and queen
.
embedding_openai_rey = torch.Tensor(client_openai.embeddings.create(input="rey", model=model_openai).data[0].embedding)
embedding_openai_hombre = torch.Tensor(client_openai.embeddings.create(input="hombre", model=model_openai).data[0].embedding)
embedding_openai_mujer = torch.Tensor(client_openai.embeddings.create(input="mujer", model=model_openai).data[0].embedding)
embedding_openai_reina = torch.Tensor(client_openai.embeddings.create(input="reina", model=model_openai).data[0].embedding)
embedding_openai_reina.shape, embedding_openai_reina
Let’s obtain the embedding resulting from subtracting the embedding of man
from king
and adding the embedding of woman
to king
.
embedding_openai = embedding_openai_rey - embedding_openai_hombre + embedding_openai_mujer
embedding_openai.shape, embedding_openai
Finally we compare the result obtained with the embedding of reina
. For this we use the cosine_similarity
function provided by the pytorch
library.
similarity_openai = cosine_similarity(embedding_openai.unsqueeze(0), embedding_openai_reina.unsqueeze(0)).item()
print(f"similarity_openai: {similarity_openai}")
As we can see it is a value very close to 1, so we can say that the result obtained is very similar to the embedding of reina
.
If we use English words, we get a result closer to 1.
embedding_openai_rey = torch.Tensor(client_openai.embeddings.create(input="king", model=model_openai).data[0].embedding)
embedding_openai_hombre = torch.Tensor(client_openai.embeddings.create(input="man", model=model_openai).data[0].embedding)
embedding_openai_mujer = torch.Tensor(client_openai.embeddings.create(input="woman", model=model_openai).data[0].embedding)
embedding_openai_reina = torch.Tensor(client_openai.embeddings.create(input="queen", model=model_openai).data[0].embedding)
embedding_openai = embedding_openai_rey - embedding_openai_hombre + embedding_openai_mujer
similarity_openai = cosine_similarity(embedding_openai.unsqueeze(0), embedding_openai_reina.unsqueeze(0))
print(f"similarity_openai: {similarity_openai}")
This is normal, since the OpenAi model has been trained with more texts in English than in Spanish.
There are several types of word embeddings, and each of them has its advantages and disadvantages. Let’s take a look at the most important ones
- Word2Vec
- GloVe
- FastText
- BERT
- GPT-2
Word2Vec is an algorithm used to create word embeddings. This algorithm was created by Google in 2013, and it is one of the most used algorithms to create word embeddings.
It has two variants, CBOW
and Skip-gram
. CBOW
is faster to train, while Skip-gram
is more accurate. Let’s see how each of them works
CBOW or
Continuous Bag of Words is an algorithm used to predict a word from the surrounding words. For example if we have the sentence
The cat is an animal, the algorithm will try to predict the word
cat from the surrounding words, in this case
The,
is,
an and
animal`.
In this architecture, the model predicts which is the most likely word in the given context. Therefore, words that have the same probability of appearing are considered similar and are therefore closer in dimensional space.
Suppose that in a sentence we replace boat
with boat
, then the model predicts the probability for both and if it turns out to be similar then we can consider that the words are similar.
Skip-gram
or Skip-gram with Negative Sampling
is an algorithm used to predict the words surrounding a word. For example if we have the sentence The cat is an animal
, the algorithm will try to predict the words The
, is
, an
and animal
from the word cat
.
This architecture is similar to that of CBOW, but instead the model works backwards. The model predicts the context using the given word. Therefore, words that have the same context are considered similar and are therefore closer in dimensional space.
GloVe or
Global Vectors for Word Representation` is an algorithm used to create word embeddings. This algorithm was created by Stanford University in 2014.
Word2Vec ignores the fact that some context words occur more frequently than others and also only take into account the local context and therefore do not capture the global context.
This algorithm uses a co-occurrence matrix to create the word embeddings. This co-occurrence matrix is a matrix that contains the number of times each word appears next to each of the other words in the vocabulary.
FastText
is an algorithm that is used to create word embeddings. This algorithm was created by Facebook in 2016.
One of the main disadvantages of Word2Vec
and GloVe
is that they cannot encode unknown or out-of-vocabulary words.
So, to deal with this problem, Facebook proposed a FastText
model. It is an extension of Word2Vec
and follows the same Skip-gram
and CBOW
model. But unlike Word2Vec
which feeds whole words into the neural network, FastText
first splits words into several subwords (or n-grams
) and then feeds them to the neural network.
For example, if the value of n
is 3 and the word is apple
then your tri-gram will be [
, man
, anz
, nza
, zan
, ana
, na>
] and your word embedding will be the sum of the vector representation of these tri-grams. Here, the hyperparameters min_n
and max_n
are considered as 3 and the characters <
and >
represent the beginning and end of the word.
Therefore, using this methodology, unknown words can be represented in vector form, since it has a high probability that their n-grams
are also present in other words.
This algorithm is an improvement of Word2Vec
, since in addition to taking into account the words surrounding a word, it also takes into account the n-grams
of the word. For example if we have the word cat
, it also takes into account the n-grams
of the word, in this case ga
, at
and to
, for n = 2
.
Word embedding techniques have given a decent result, but the problem is that the approach is not precise enough. They do not take into account the order of the words in which they appear, which leads to loss of syntactic and semantic understanding of the sentence.
For example, You go there to teach, not to play
AND You go there to play, not to teach
Both sentences will have the same representation in vector space, but they do not mean the same thing.
In addition, the word embedding model cannot give satisfactory results on a large amount of text data, since the same word may have a different meaning in a different sentence depending on the context of the sentence.
For example, I am going to sit in the bank
AND I am going to do business in the bank
In both sentences, the word bank
has different meanings.
Therefore, we require a type of representation that can retain the contextual meaning of the word present in a sentence.
Sentence embedding is similar to word embedding, but instead of words, they encode the whole sentence in the vector representation.
A simple way to obtain sentence embedding is to average the word embedding of all the words present in the sentence. But they are not accurate enough.
Some of the most advanced models for sentence embedding are ELMo
, InferSent
and Sentence-BERT
.
ELMo
or Embeddings from Language Models
is a sentence embedding model that was created by Allen University in 2018. It uses a bidirectional deep LSTM network to produce vector representation. ELMo
can represent unknown or out-of-vocabulary words in vector form since it is character-based.
InferSent
is a sentence embedding model that was created by Facebook in 2017. It uses a bidirectional deep LSTM network to produce vector representation. InferSent
can represent unknown or out-of-vocabulary words in vector form as it is character-based. Sentences are encoded in a 4096-dimensional vector representation.
The training of the model is performed on the Stanford Natural Language Inference (SNLI
) dataset. This dataset is labeled and written by humans for about 500K sentence pairs.
Sentence-BERT is a sentence embedding model that was created by the University of London in 2019. It uses a bidirectional deep LSTM network to produce vector representation. Sentence-BERT
can represent unknown or out-of-vocabulary words in vector form as it is character-based. Sentences are encoded in a 768-dimensional vector representation.
The state-of-the-art NLP model BERT
is excellent in Semantic Textual Similarity tasks, but the problem is that it would take a long time for a huge corpus (65 hours for 10,000 sentences), as it requires both sentences to be entered into the network and this increases the computation by a huge factor.
Therefore, Sentence-BERT
is a modification of the BERT
model.
To download the dataset we are going to use, the dataset
library of huggingface must be installed:
pip install datasets
To train the embeddings model we are going to use the gensim
library. To install it with conda we use
conda install -c conda-forge gensim
And to install it with pip we use
pip install gensim
To clean the dataset we have downloaded we are going to use regular expressions, which is usually already installed in python, and nltk
which is a natural language processing library. To install it with conda we use
conda install -c anaconda nltk
And to install it with pip we use
pip install nltk
Now that we have everything installed we can import the libraries we are going to use:
from gensim.models import Word2Vec
from gensim.parsing.preprocessing import strip_punctuation, strip_numeric, strip_short
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
We are going to download a dataset of texts from wikipedia in Spanish, for this we execute the following:
from datasets import load_dataset
dataset_corpus = load_dataset('large_spanish_corpus', name='all_wikis')
Let’s see what it looks like
dataset_corpus
As we can see, the dataset has more than 28 million texts. Let’s take a look at some of them:
dataset_corpus['train']['text'][0:10]
As there are many examples we will create a subset of 10 million examples to work faster:
subset = dataset_corpus['train'].select(range(10000000))
Now we download the stopwords
from nltk
, which are words that do not provide information and that we are going to eliminate from the texts.
import nltk
nltk.download('stopwords')
Now we are going to download the punkt
of nltk
, which is a tokenizer
that will allow us to separate the texts into sentences
nltk.download('punkt')
We create a function to clean the data, this function is going to:
- Change the text to lower case
- Remove urls
- Remove mentions to social networks such as
@twitter
p#hashtag
. - Eliminate punctuation marks
- Eliminate numbers
- Eliminate short words
- Eliminate stopwords
As we are using a huggeface dataset, the texts are in dict
format, so we return a dictionary.
def clean_text(sentence_batch):
# extrae el texto de la entrada
text_list = sentence_batch['text']
cleaned_text_list = []
for text in text_list:
# Convierte el texto a minúsculas
text = text.lower()
# Elimina URLs
text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)
# Elimina las menciones @ y '#' de las redes sociales
text = re.sub(r'\@\w+|\#\w+', '', text)
# Elimina los caracteres de puntuación
text = strip_punctuation(text)
# Elimina los números
text = strip_numeric(text)
# Elimina las palabras cortas
text = strip_short(text,minsize=2)
# Elimina las palabras comunes (stop words)
stop_words = set(stopwords.words('spanish'))
word_tokens = word_tokenize(text)
filtered_text = [word for word in word_tokens if word not in stop_words]
cleaned_text_list.append(filtered_text)
# Devuelve el texto limpio
return {'text': cleaned_text_list}
We apply the function to the data
sentences_corpus = subset.map(clean_text, batched=True)
Let’s save the filtered dataset in a file to avoid having to run the cleaning process again.
sentences_corpus.save_to_disk("sentences_corpus")
To load it we can do
from datasets import load_from_disk
sentences_corpus = load_from_disk('sentences_corpus')
Now what we are going to have is a list of lists, where each list is a tokenized phrase and without stopwords. That is, we have a list of phrases, and each phrase is a list of words. Let’s see what it looks like:
for i in range(10):
print(f'La frase "{subset["text"][i]}" se convierte en la lista de palabras "{sentences_corpus["text"][i]}"')
We are going to train an embeddings model that will convert words into vectors. For this we are going to use the gensim
library and its Word2Vec
model.
dataset = sentences_corpus['text']
dim_embedding = 100
window_size = 5 # 5 palabras a la izquierda y 5 palabras a la derecha
min_count = 5 # Ignora las palabras con frecuencia menor a 5
workers = 4 # Número de hilos de ejecución
sg = 1 # 0 para CBOW, 1 para Skip-gram
model = Word2Vec(dataset, vector_size=dim_embedding, window=window_size, min_count=min_count, workers=workers, sg=sg)
This model has been trained on the CPU, since gensim
does not have the option to perform the training on the GPU and even so on my computer it has taken X minutes to train the model. Although the size of the embedding we have chosen is only 100 (as opposed to the size of the openai embeddings which is 1536), it is not a very long time, since the dataset has 10 million sentences.
Large language models are trained with datasets of billions of sentences, so it is normal that training an embeddings model with a dataset of 10 million sentences takes a few minutes.
Once the model is trained we save it in a file for future use.
model.save('word2vec.model')
If we would like to load it in the future, we can do so with
model = Word2Vec.load('word2vec.model')
Let’s see the most similar words of some words
model.wv.most_similar('perro', topn=10)
model.wv.most_similar('gato', topn=10)
Now let’s see the example in which we check the similarity of the word queen
with the result of subtracting the word man
from the word king
and adding the word woman
to the word king
.
embedding_hombre = model.wv['hombre']
embedding_mujer = model.wv['mujer']
embedding_rey = model.wv['rey']
embedding_reina = model.wv['reina']
embedding = embedding_rey - embedding_hombre + embedding_mujer
from torch.nn.functional import cosine_similarity
embedding = torch.tensor(embedding).unsqueeze(0)
embedding_reina = torch.tensor(embedding_reina).unsqueeze(0)
similarity = cosine_similarity(embedding, embedding_reina, dim=1)
similarity
As we can see, there is a lot of similarity
We are going to visualize the embeddings, for this we first obtain the vectors and the words of the model
embeddings = model.wv.vectors
words = list(model.wv.index_to_key)
As the dimension of the embeddings is 100, to be able to visualize them in 2 or 3 dimensions we have to reduce the dimension. For this we will use PCA
(faster) or TSNE
(more precise) of sklearn
.
from sklearn.decomposition import PCA
dimmesions = 2
pca = PCA(n_components=dimmesions)
reduced_embeddings_PCA = pca.fit_transform(embeddings)
from sklearn.manifold import TSNE
dimmesions = 2
tsne = TSNE(n_components=dimmesions, verbose=1, perplexity=40, n_iter=300)
reduced_embeddings_tsne = tsne.fit_transform(embeddings)
Now we visualize them in 2 dimensions with matplotlib
. Let’s visualize the dimensionality reduction we have done with PCA
and with TSNE
.
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 10))
for i, word in enumerate(words[:200]): # Limitar a las primeras 200 palabras
plt.scatter(reduced_embeddings_PCA[i, 0], reduced_embeddings_PCA[i, 1])
plt.annotate(word, xy=(reduced_embeddings_PCA[i, 0], reduced_embeddings_PCA[i, 1]), xytext=(5, 2), textcoords='offset points', ha='right', va='bottom')
plt.title('Embeddings (PCA)')
plt.show()
plt.figure(figsize=(10, 10))
for i, word in enumerate(words[:200]): # Limitar a las primeras 200 palabras
plt.scatter(reduced_embeddings_tsne[i, 0], reduced_embeddings_tsne[i, 1])
plt.annotate(word, xy=(reduced_embeddings_tsne[i, 0], reduced_embeddings_tsne[i, 1]), xytext=(5, 2),
textcoords='offset points', ha='right', va='bottom')
plt.show()
To use pre-trained embeddings
models we will use the transformers
library from huggingface
. To install it with conda we use
conda install -c conda-forge transformers
And to install it with pip we use
pip install transformers
With the feature-extraction
task of huggingface
we can use pre-trained models to obtain the embeddings of the words. To do this we first import the necessary library
from transformers import pipeline
Let’s get the embeddings
from BERT
.
checkpoint = "bert-base-uncased"
feature_extractor = pipeline("feature-extraction",framework="pt",model=checkpoint)
Let’s see the embeddings
of the word king
.
embedding = feature_extractor("rey", return_tensors="pt").squeeze(0)
embedding.shape
As we see we obtain a vector of 768
dimensions, that is to say, the embeddings
of BERT
have 768
dimensions. On the other hand we see that it has 3 vectors of embeddings
, this is because BERT
adds a token at the beginning and another at the end of the sentence, so we are only interested in the middle vector.
Let us redo the example in which we check the similarity of the word queen
with the result of subtracting the word man
from the word king
and adding the word woman
to the word king
.
embedding_hombre = feature_extractor("man", return_tensors="pt").squeeze(0)[1]
embedding_mujer = feature_extractor("woman", return_tensors="pt").squeeze(0)[1]
embedding_rey = feature_extractor("king", return_tensors="pt").squeeze(0)[1]
embedding_reina = feature_extractor("queen", return_tensors="pt").squeeze(0)[1]
embedding = embedding_rey - embedding_hombre + embedding_mujer
Let’s see the similarity
import torch
from torch.nn.functional import cosine_similarity
embedding = torch.tensor(embedding).unsqueeze(0)
embedding_reina = torch.tensor(embedding_reina).unsqueeze(0)
similarity = cosine_similarity(embedding, embedding_reina, dim=1)
similarity.item()
Using the embeddings
of BERT
we also get a result very close to 1