Embeddings

In this post we explore the world of embeddings in artificial intelligence, an essential technique that transforms complex data, such as text, into numerical formats understandable to machine learning models. Discover how embeddings capture meanings and relationships in data, facilitating language processing, and understand their impact on the advancement of LLM technology.

Open In Colab

In a previous post about tokens, we already saw the minimum representation of each word. Which corresponds to giving a number to the minimum division of each word.

However, the transformers, and therefore the LLMs, do not represent the information of the words in this way, but do so by means of embeddings.

This notebook has been automatically translated to make it accessible to more people, please let me know if you see any typos.

We are going to see first two ways of representing words, the ordinal encoding and the one hot encoding. And seeing the problems of these two types of representations we will be able to get to word embeddings and sentence embeddings.

We will also see an example of how to train a word embeddings model with the gensim library.

And finally we will see how to use pre-trained models of embeddings with the transformers library of HuggingFace.

This is the most basic way to represent the words inside the transformers. It consists of giving a number to each word, or keeping the numbers already assigned to the tokens.

However, this type of representation has two problems

  • Let us imagine that table corresponds to token 3, cat to token 1 and dog to token 2. One could assume that table = cat + dog, but it is not so. There is no such relationship between these words. We might even think that by assigning the correct tokens, this type of relationship could occur. However, this thought falls apart with words that have more than one meaning, such as the word bank, for example.

  • The second problem is that neural networks internally do a lot of numerical calculations, so it could be the case that if mesa has token 3, it has internally more importance than the word cat which has token 1.

So this type of word representation can be discarded very quickly.

Here what is done is to use N dimensional vectors. For example we saw that OpenAI has a vocabulary of 100277 distinct tokens. So if we use one hot encoding, each word would be represented with a vector of 100277 dimensions.

However, the one hot encoding has two other major problems

  • It does not take into account the relationship between words. So if we have two words that are synonyms, such as cat and feline, we would have two different vectors to represent them.

In language the relationship between words is very important, and not taking this relationship into account is a big problem.

  • The second problem is that vectors are very large. If we have a vocabulary of 100277 tokens, each word would be represented by a vector of 100277 dimensions. This makes the vectors very large and computationally very expensive. In addition these vectors are going to be all zeros, except in the position corresponding to the word token. So most of the calculations are going to be multiplications by zero, which are calculations that don’t add anything. So we are going to have a lot of memory allocated to vectors where you only have a 1 at a given position.

With word embeddings we try to solve the problems of the two previous types of representations. For this purpose vectors of N dimensions are used, but in this case vectors of 100277 dimensions are not used, but vectors of much less dimensions are used. For example we will see that OpenAI uses 1536 dimensions.

Each of the dimensions of these vectors represents a characteristic of the word. For example one of the dimensions could represent whether the word is a verb or a noun. Another dimension might represent whether the word is an animal or not. Another dimension might represent whether the word is a proper noun or not. And so on.

However, these features are not defined by hand, but are learned automatically. During the training of the transformers, the values of each of the dimensions of the vectors are adjusted, so that the characteristics of each of the words are learned.

By making each of the word dimensions represent a characteristic of the word, words that have similar characteristics will have similar vectors. For example the words cat and feline will have very similar vectors, since they are both animals. And the words table and chair will have similar vectors, since both are furniture.

In the following image we can see a 3-dimensional representation of words, and we can see that all words related to school are close, all words related to food are close and all words related to ball are close.

word_embedding_3_dimmension

Having each of the dimensions of the vectors represent a characteristic of the word allows us to perform operations with words. For example, if we subtract the word king from the word man and add the word woman, we get a word very similar to the word queen. We will check it later with an example

As each word is represented by a vector of N dimensions, we can calculate the similarity between two words. The cosine similarity function or cosine similarity is used for this purpose.

If two words are close in vector space, it means that the angle between their vectors is small, so their cosine is close to 1. If there is an angle of 90 degrees between the vectors, the cosine is 0, meaning that there is no similarity between the words. And if there is an angle of 180 degrees between the vectors, the cosine is -1, that is, the words are opposites.

cosine similarity

Now that we know what embeddings are, let’s see some examples with the embeddings provided by the API of OpenAI.

To do this we first need to have the OpenAI package installed.

pip install openai

We import the necessary libraries

Code:

from openai import OpenAI

import torch

from torch.nn.functional import cosine_similarity

We use an OpenAI API key. To do this, go to the OpenAI page, and register. Once registered, go to the API Keys section, and create a new API Key.

open ai api key

Code:

api_key = "Pon aquí tu API key"

We select which embeddings model we want to use. In this case we are going to use text-embedding-ada-002 which is the one recommended by OpenAI in its embeddings documentation.

Code:

model_openai = "text-embedding-ada-002"

Create an API client

Code:

client_openai = OpenAI(api_key=api_key, organization=None)

Let’s see how are the embeddings of the word King.

Code:

word = "Rey"

embedding_openai = torch.Tensor(client_openai.embeddings.create(input=word, model=model_openai).data[0].embedding)

embedding_openai.shape, embedding_openai

Output:

(torch.Size([1536]),

tensor([-0.0103, -0.0005, -0.0189, ..., -0.0009, -0.0226, 0.0045]))

As we can see we obtain a vector of 1536 dimensions

Let’s get the embeddings of the words king, man, woman and queen.

Code:

embedding_openai_rey = torch.Tensor(client_openai.embeddings.create(input="rey", model=model_openai).data[0].embedding)

embedding_openai_hombre = torch.Tensor(client_openai.embeddings.create(input="hombre", model=model_openai).data[0].embedding)

embedding_openai_mujer = torch.Tensor(client_openai.embeddings.create(input="mujer", model=model_openai).data[0].embedding)

embedding_openai_reina = torch.Tensor(client_openai.embeddings.create(input="reina", model=model_openai).data[0].embedding)

Code:

embedding_openai_reina.shape, embedding_openai_reina

Output:

(torch.Size([1536]),

tensor([-0.0110, -0.0084, -0.0115, ..., 0.0082, -0.0096, -0.0024]))

Let’s obtain the embedding resulting from subtracting the embedding of man from king and adding the embedding of woman to king.

Code:

embedding_openai = embedding_openai_rey - embedding_openai_hombre + embedding_openai_mujer

Code:

embedding_openai.shape, embedding_openai

Output:

(torch.Size([1536]),

tensor([-0.0226, -0.0323, 0.0017, ..., 0.0014, -0.0290, -0.0188]))

Finally we compare the result obtained with the embedding of reina. For this we use the cosine_similarity function provided by the pytorch library.

Code:

similarity_openai = cosine_similarity(embedding_openai.unsqueeze(0), embedding_openai_reina.unsqueeze(0)).item()

print(f"similarity_openai: {similarity_openai}")

Output:

similarity_openai: 0.7564167976379395

As we can see it is a value very close to 1, so we can say that the result obtained is very similar to the embedding of reina.

If we use English words, we get a result closer to 1.

Code:

embedding_openai_rey = torch.Tensor(client_openai.embeddings.create(input="king", model=model_openai).data[0].embedding)

embedding_openai_hombre = torch.Tensor(client_openai.embeddings.create(input="man", model=model_openai).data[0].embedding)

embedding_openai_mujer = torch.Tensor(client_openai.embeddings.create(input="woman", model=model_openai).data[0].embedding)

embedding_openai_reina = torch.Tensor(client_openai.embeddings.create(input="queen", model=model_openai).data[0].embedding)

Code:

embedding_openai = embedding_openai_rey - embedding_openai_hombre + embedding_openai_mujer

Code:

similarity_openai = cosine_similarity(embedding_openai.unsqueeze(0), embedding_openai_reina.unsqueeze(0))

print(f"similarity_openai: {similarity_openai}")

Output:

similarity_openai: tensor([0.8849])

This is normal, since the OpenAi model has been trained with more texts in English than in Spanish.

There are several types of word embeddings, and each of them has its advantages and disadvantages. Let’s take a look at the most important ones

  • Word2Vec
  • GloVe
  • FastText
  • BERT
  • GPT-2

Word2Vec is an algorithm used to create word embeddings. This algorithm was created by Google in 2013, and it is one of the most used algorithms to create word embeddings.

It has two variants, CBOW and Skip-gram. CBOW is faster to train, while Skip-gram is more accurate. Let’s see how each of them works

CBOW or Continuous Bag of Words is an algorithm used to predict a word from the surrounding words. For example if we have the sentence The cat is an animal, the algorithm will try to predict the word cat from the surrounding words, in this case The, is, an and animal`.

CBOW

In this architecture, the model predicts which is the most likely word in the given context. Therefore, words that have the same probability of appearing are considered similar and are therefore closer in dimensional space.

Suppose that in a sentence we replace boat with boat, then the model predicts the probability for both and if it turns out to be similar then we can consider that the words are similar.

Skip-gram or Skip-gram with Negative Sampling is an algorithm used to predict the words surrounding a word. For example if we have the sentence The cat is an animal, the algorithm will try to predict the words The, is, an and animal from the word cat.

Skip-gram

This architecture is similar to that of CBOW, but instead the model works backwards. The model predicts the context using the given word. Therefore, words that have the same context are considered similar and are therefore closer in dimensional space.

GloVe or Global Vectors for Word Representation` is an algorithm used to create word embeddings. This algorithm was created by Stanford University in 2014.

Word2Vec ignores the fact that some context words occur more frequently than others and also only take into account the local context and therefore do not capture the global context.

This algorithm uses a co-occurrence matrix to create the word embeddings. This co-occurrence matrix is a matrix that contains the number of times each word appears next to each of the other words in the vocabulary.

FastText is an algorithm that is used to create word embeddings. This algorithm was created by Facebook in 2016.

One of the main disadvantages of Word2Vec and GloVe is that they cannot encode unknown or out-of-vocabulary words.

So, to deal with this problem, Facebook proposed a FastText model. It is an extension of Word2Vec and follows the same Skip-gram and CBOW model. But unlike Word2Vec which feeds whole words into the neural network, FastText first splits words into several subwords (or n-grams) and then feeds them to the neural network.

For example, if the value of n is 3 and the word is apple then your tri-gram will be [, man, anz, nza, zan, ana, na>] and your word embedding will be the sum of the vector representation of these tri-grams. Here, the hyperparameters min_n and max_n are considered as 3 and the characters < and > represent the beginning and end of the word.

Therefore, using this methodology, unknown words can be represented in vector form, since it has a high probability that their n-grams are also present in other words.

This algorithm is an improvement of Word2Vec, since in addition to taking into account the words surrounding a word, it also takes into account the n-grams of the word. For example if we have the word cat, it also takes into account the n-grams of the word, in this case ga, at and to, for n = 2.

Word embedding techniques have given a decent result, but the problem is that the approach is not precise enough. They do not take into account the order of the words in which they appear, which leads to loss of syntactic and semantic understanding of the sentence.

For example, You go there to teach, not to play AND You go there to play, not to teach Both sentences will have the same representation in vector space, but they do not mean the same thing.

In addition, the word embedding model cannot give satisfactory results on a large amount of text data, since the same word may have a different meaning in a different sentence depending on the context of the sentence.

For example, I am going to sit in the bank AND I am going to do business in the bank In both sentences, the word bank has different meanings.

Therefore, we require a type of representation that can retain the contextual meaning of the word present in a sentence.

Sentence embedding is similar to word embedding, but instead of words, they encode the whole sentence in the vector representation.

A simple way to obtain sentence embedding is to average the word embedding of all the words present in the sentence. But they are not accurate enough.

Some of the most advanced models for sentence embedding are ELMo, InferSent and Sentence-BERT.

ELMo or Embeddings from Language Models is a sentence embedding model that was created by Allen University in 2018. It uses a bidirectional deep LSTM network to produce vector representation. ELMo can represent unknown or out-of-vocabulary words in vector form since it is character-based.

InferSent is a sentence embedding model that was created by Facebook in 2017. It uses a bidirectional deep LSTM network to produce vector representation. InferSent can represent unknown or out-of-vocabulary words in vector form as it is character-based. Sentences are encoded in a 4096-dimensional vector representation.

The training of the model is performed on the Stanford Natural Language Inference (SNLI) dataset. This dataset is labeled and written by humans for about 500K sentence pairs.

Sentence-BERT is a sentence embedding model that was created by the University of London in 2019. It uses a bidirectional deep LSTM network to produce vector representation. Sentence-BERT can represent unknown or out-of-vocabulary words in vector form as it is character-based. Sentences are encoded in a 768-dimensional vector representation.

The state-of-the-art NLP model BERT is excellent in Semantic Textual Similarity tasks, but the problem is that it would take a long time for a huge corpus (65 hours for 10,000 sentences), as it requires both sentences to be entered into the network and this increases the computation by a huge factor.

Therefore, Sentence-BERT is a modification of the BERT model.

To download the dataset we are going to use, the dataset library of huggingface must be installed:

pip install datasets

To train the embeddings model we are going to use the gensim library. To install it with conda we use

conda install -c conda-forge gensim

And to install it with pip we use

pip install gensim

To clean the dataset we have downloaded we are going to use regular expressions, which is usually already installed in python, and nltk which is a natural language processing library. To install it with conda we use

conda install -c anaconda nltk

And to install it with pip we use

pip install nltk

Now that we have everything installed we can import the libraries we are going to use:

Code:

from gensim.models import Word2Vec

from gensim.parsing.preprocessing import strip_punctuation, strip_numeric, strip_short

import re

from nltk.corpus import stopwords

from nltk.tokenize import word_tokenize

We are going to download a dataset of texts from wikipedia in Spanish, for this we execute the following:

Code:

from datasets import load_dataset

dataset_corpus = load_dataset('large_spanish_corpus', name='all_wikis')

Let’s see what it looks like

Code:

dataset_corpus

Output:

DatasetDict({

train: Dataset({

features: ['text'],

num_rows: 28109484

})

})

As we can see, the dataset has more than 28 million texts. Let’s take a look at some of them:

Code:

dataset_corpus['train']['text'][0:10]

Output:

['¡Bienvenidos!',

'Ir a los contenidos»',

'= Contenidos =',

'',

'Portada',

'Tercera Lengua más hablada en el mundo.',

'La segunda en número de habitantes en el mundo occidental.',

'La de mayor proyección y crecimiento día a día.',

'El español es, hoy en día, nombrado en cada vez más contextos, tomando realce internacional como lengua de cultura y civilización siempre de mayor envergadura.',

'Ejemplo de ello es que la comunidad minoritaria más hablada en los Estados Unidos es precisamente la que habla idioma español.']

As there are many examples we will create a subset of 10 million examples to work faster:

Code:

subset = dataset_corpus['train'].select(range(10000000))

Now we download the stopwords from nltk, which are words that do not provide information and that we are going to eliminate from the texts.

Code:

import nltk

nltk.download('stopwords')

Output:

True

Now we are going to download the punkt of nltk, which is a tokenizer that will allow us to separate the texts into sentences

Code:

nltk.download('punkt')

Output:

True

We create a function to clean the data, this function is going to:

  • Change the text to lower case
  • Remove urls
  • Remove mentions to social networks such as @twitter p #hashtag.
  • Eliminate punctuation marks
  • Eliminate numbers
  • Eliminate short words
  • Eliminate stopwords

As we are using a huggeface dataset, the texts are in dict format, so we return a dictionary.

Code:

def clean_text(sentence_batch):

# extrae el texto de la entrada

text_list = sentence_batch['text']

cleaned_text_list = []

for text in text_list:

# Convierte el texto a minúsculas

text = text.lower()

# Elimina URLs

text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)

# Elimina las menciones @ y '#' de las redes sociales

text = re.sub(r'\@\w+|\#\w+', '', text)

# Elimina los caracteres de puntuación

text = strip_punctuation(text)

# Elimina los números

text = strip_numeric(text)

# Elimina las palabras cortas

text = strip_short(text,minsize=2)

# Elimina las palabras comunes (stop words)

stop_words = set(stopwords.words('spanish'))

word_tokens = word_tokenize(text)

filtered_text = [word for word in word_tokens if word not in stop_words]

cleaned_text_list.append(filtered_text)

# Devuelve el texto limpio

return {'text': cleaned_text_list}

We apply the function to the data

Code:

sentences_corpus = subset.map(clean_text, batched=True)

Output:

Map: 0%| | 0/10000000 [00:00<?, ? examples/s]

Let’s save the filtered dataset in a file to avoid having to run the cleaning process again.

Code:

sentences_corpus.save_to_disk("sentences_corpus")

Output:

Saving the dataset (0/4 shards): 0%| | 0/15000000 [00:00<?, ? examples/s]

To load it we can do

Code:

from datasets import load_from_disk

sentences_corpus = load_from_disk('sentences_corpus')

Now what we are going to have is a list of lists, where each list is a tokenized phrase and without stopwords. That is, we have a list of phrases, and each phrase is a list of words. Let’s see what it looks like:

Code:

for i in range(10):

print(f'La frase "{subset["text"][i]}" se convierte en la lista de palabras "{sentences_corpus["text"][i]}"')

Output:

La frase "¡Bienvenidos!" se convierte en la lista de palabras "['¡bienvenidos']"

La frase "Ir a los contenidos»" se convierte en la lista de palabras "['ir', 'contenidos', '»']"

La frase "= Contenidos =" se convierte en la lista de palabras "['contenidos']"

La frase "" se convierte en la lista de palabras "[]"

La frase "Portada" se convierte en la lista de palabras "['portada']"

La frase "Tercera Lengua más hablada en el mundo." se convierte en la lista de palabras "['tercera', 'lengua', 'hablada', 'mundo']"

La frase "La segunda en número de habitantes en el mundo occidental." se convierte en la lista de palabras "['segunda', 'número', 'habitantes', 'mundo', 'occidental']"

La frase "La de mayor proyección y crecimiento día a día." se convierte en la lista de palabras "['mayor', 'proyección', 'crecimiento', 'día', 'día']"

La frase "El español es, hoy en día, nombrado en cada vez más contextos, tomando realce internacional como lengua de cultura y civilización siempre de mayor envergadura." se convierte en la lista de palabras "['español', 'hoy', 'día', 'nombrado', 'cada', 'vez', 'contextos', 'tomando', 'realce', 'internacional', 'lengua', 'cultura', 'civilización', 'siempre', 'mayor', 'envergadura']"

La frase "Ejemplo de ello es que la comunidad minoritaria más hablada en los Estados Unidos es precisamente la que habla idioma español." se convierte en la lista de palabras "['ejemplo', 'ello', 'comunidad', 'minoritaria', 'hablada', 'unidos', 'precisamente', 'habla', 'idioma', 'español']"

We are going to train an embeddings model that will convert words into vectors. For this we are going to use the gensim library and its Word2Vec model.

Code:

dataset = sentences_corpus['text']

dim_embedding = 100

window_size = 5 # 5 palabras a la izquierda y 5 palabras a la derecha

min_count = 5 # Ignora las palabras con frecuencia menor a 5

workers = 4 # Número de hilos de ejecución

sg = 1 # 0 para CBOW, 1 para Skip-gram

model = Word2Vec(dataset, vector_size=dim_embedding, window=window_size, min_count=min_count, workers=workers, sg=sg)

This model has been trained on the CPU, since gensim does not have the option to perform the training on the GPU and even so on my computer it has taken X minutes to train the model. Although the size of the embedding we have chosen is only 100 (as opposed to the size of the openai embeddings which is 1536), it is not a very long time, since the dataset has 10 million sentences.

Large language models are trained with datasets of billions of sentences, so it is normal that training an embeddings model with a dataset of 10 million sentences takes a few minutes.

Once the model is trained we save it in a file for future use.

Code:

model.save('word2vec.model')

If we would like to load it in the future, we can do so with

Code:

model = Word2Vec.load('word2vec.model')

Let’s see the most similar words of some words

Code:

model.wv.most_similar('perro', topn=10)

Output:

[('gato', 0.7948548197746277),

('perros', 0.77247554063797),

('cachorro', 0.7638891339302063),

('hámster', 0.7540281414985657),

('caniche', 0.7514827251434326),

('bobtail', 0.7492328882217407),

('mastín', 0.7491254210472107),

('lobo', 0.7312178611755371),

('semental', 0.7292628288269043),

('sabueso', 0.7290207147598267)]

Code:

model.wv.most_similar('gato', topn=10)

Output:

[('conejo', 0.8148329854011536),

('zorro', 0.8109457492828369),

('perro', 0.7948548793792725),

('lobo', 0.7878773808479309),

('ardilla', 0.7860757112503052),

('mapache', 0.7817519307136536),

('huiña', 0.766639232635498),

('oso', 0.7656188011169434),

('mono', 0.7633568644523621),

('camaleón', 0.7623056769371033)]

Now let’s see the example in which we check the similarity of the word queen with the result of subtracting the word man from the word king and adding the word woman to the word king.

Code:

embedding_hombre = model.wv['hombre']

embedding_mujer = model.wv['mujer']

embedding_rey = model.wv['rey']

embedding_reina = model.wv['reina']

Code:

embedding = embedding_rey - embedding_hombre + embedding_mujer

Code:

from torch.nn.functional import cosine_similarity

embedding = torch.tensor(embedding).unsqueeze(0)

embedding_reina = torch.tensor(embedding_reina).unsqueeze(0)

similarity = cosine_similarity(embedding, embedding_reina, dim=1)

similarity

Output:

tensor([0.8156])

As we can see, there is a lot of similarity

We are going to visualize the embeddings, for this we first obtain the vectors and the words of the model

Code:

embeddings = model.wv.vectors

words = list(model.wv.index_to_key)

As the dimension of the embeddings is 100, to be able to visualize them in 2 or 3 dimensions we have to reduce the dimension. For this we will use PCA (faster) or TSNE (more precise) of sklearn.

Code:

from sklearn.decomposition import PCA

dimmesions = 2

pca = PCA(n_components=dimmesions)

reduced_embeddings_PCA = pca.fit_transform(embeddings)

Code:

from sklearn.manifold import TSNE

dimmesions = 2

tsne = TSNE(n_components=dimmesions, verbose=1, perplexity=40, n_iter=300)

reduced_embeddings_tsne = tsne.fit_transform(embeddings)

Output:

[t-SNE] Computing 121 nearest neighbors...

[t-SNE] Indexed 493923 samples in 0.013s...

[t-SNE] Computed neighbors for 493923 samples in 377.143s...

[t-SNE] Computed conditional probabilities for sample 1000 / 493923

[t-SNE] Computed conditional probabilities for sample 2000 / 493923

[t-SNE] Computed conditional probabilities for sample 3000 / 493923

[t-SNE] Computed conditional probabilities for sample 4000 / 493923

[t-SNE] Computed conditional probabilities for sample 5000 / 493923

[t-SNE] Computed conditional probabilities for sample 6000 / 493923

[t-SNE] Computed conditional probabilities for sample 7000 / 493923

[t-SNE] Computed conditional probabilities for sample 8000 / 493923

[t-SNE] Computed conditional probabilities for sample 9000 / 493923

[t-SNE] Computed conditional probabilities for sample 10000 / 493923

[t-SNE] Computed conditional probabilities for sample 11000 / 493923

[t-SNE] Computed conditional probabilities for sample 12000 / 493923

[t-SNE] Computed conditional probabilities for sample 13000 / 493923

[t-SNE] Computed conditional probabilities for sample 14000 / 493923

[t-SNE] Computed conditional probabilities for sample 15000 / 493923

[t-SNE] Computed conditional probabilities for sample 16000 / 493923

[t-SNE] Computed conditional probabilities for sample 17000 / 493923

[t-SNE] Computed conditional probabilities for sample 18000 / 493923

[t-SNE] Computed conditional probabilities for sample 19000 / 493923

[t-SNE] Computed conditional probabilities for sample 20000 / 493923

[t-SNE] Computed conditional probabilities for sample 21000 / 493923

[t-SNE] Computed conditional probabilities for sample 22000 / 493923

...

[t-SNE] Computed conditional probabilities for sample 493923 / 493923

[t-SNE] Mean sigma: 0.275311

[t-SNE] KL divergence after 250 iterations with early exaggeration: 117.413788

[t-SNE] KL divergence after 300 iterations: 5.774648

Now we visualize them in 2 dimensions with matplotlib. Let’s visualize the dimensionality reduction we have done with PCA and with TSNE.

Code:

import matplotlib.pyplot as plt

plt.figure(figsize=(10, 10))

for i, word in enumerate(words[:200]): # Limitar a las primeras 200 palabras

plt.scatter(reduced_embeddings_PCA[i, 0], reduced_embeddings_PCA[i, 1])

plt.annotate(word, xy=(reduced_embeddings_PCA[i, 0], reduced_embeddings_PCA[i, 1]), xytext=(5, 2), textcoords='offset points', ha='right', va='bottom')

plt.title('Embeddings (PCA)')

plt.show()

Output:
Code:

plt.figure(figsize=(10, 10))

for i, word in enumerate(words[:200]): # Limitar a las primeras 200 palabras

plt.scatter(reduced_embeddings_tsne[i, 0], reduced_embeddings_tsne[i, 1])

plt.annotate(word, xy=(reduced_embeddings_tsne[i, 0], reduced_embeddings_tsne[i, 1]), xytext=(5, 2),

textcoords='offset points', ha='right', va='bottom')

plt.show()

Output:

To use pre-trained embeddings models we will use the transformers library from huggingface. To install it with conda we use

conda install -c conda-forge transformers

And to install it with pip we use

pip install transformers

With the feature-extraction task of huggingface we can use pre-trained models to obtain the embeddings of the words. To do this we first import the necessary library

Code:

from transformers import pipeline

Let’s get the embeddings from BERT.

Code:

checkpoint = "bert-base-uncased"

feature_extractor = pipeline("feature-extraction",framework="pt",model=checkpoint)

Let’s see the embeddings of the word king.

Code:

embedding = feature_extractor("rey", return_tensors="pt").squeeze(0)

embedding.shape

Output:

torch.Size([3, 768])

As we see we obtain a vector of 768 dimensions, that is to say, the embeddings of BERT have 768 dimensions. On the other hand we see that it has 3 vectors of embeddings, this is because BERT adds a token at the beginning and another at the end of the sentence, so we are only interested in the middle vector.

Let us redo the example in which we check the similarity of the word queen with the result of subtracting the word man from the word king and adding the word woman to the word king.

Code:

embedding_hombre = feature_extractor("man", return_tensors="pt").squeeze(0)[1]

embedding_mujer = feature_extractor("woman", return_tensors="pt").squeeze(0)[1]

embedding_rey = feature_extractor("king", return_tensors="pt").squeeze(0)[1]

embedding_reina = feature_extractor("queen", return_tensors="pt").squeeze(0)[1]

Code:

embedding = embedding_rey - embedding_hombre + embedding_mujer

Let’s see the similarity

Code:

import torch

from torch.nn.functional import cosine_similarity

embedding = torch.tensor(embedding).unsqueeze(0)

embedding_reina = torch.tensor(embedding_reina).unsqueeze(0)

similarity = cosine_similarity(embedding, embedding_reina, dim=1)

similarity.item()

Output:

0.742547333240509

Using the embeddings of BERT we also get a result very close to 1

Continue reading
DoLa-thumbnail

DoLa – Decoding by Contrasting Layers Improves Factuality in Large Language Models

Have you ever talked to an LLM and they answered you something that sounds like they’ve been drinking machine coffee …
QLoRA_thumbnail_EN

QLoRA: Efficient Finetuning of Quantized LLMs

Hello everyone! 🤗 Today we are going to talk about QLoRA, the technique that will allow you to make your …
GPTQ-thumbnail-shot

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Attention developers! 🚨 Do you have a language model that is too big and heavy for your application? 🤯 Don’t …
llm.int8()-thumbnail

llm.int8() – 8-bit Matrix Multiplication for Transformers at Scale

Get ready to save space and speed up your models! 💥 In this post, I’m going to explore the llm.int8() …
quantization-thumbnail

LLMs quantization

Imagine having a giant language model that can answer any question, from the capital of France to the perfect brownie …
LoRA_thumbnail_EN

LoRA – low rank adaptation of large language models

Get ready to take your model adaptation to the next level with LoRA! 🚀 This low-rank adaptation technique is like …
Resumen
Embeddings
Nombre del artículo
Embeddings
Descripción
In this post we explore the world of embeddings in artificial intelligence, an essential technique that transforms complex data, such as text, into numerical formats understandable to machine learning models. Discover how embeddings capture meanings and relationships in data, facilitating language processing, and understand their impact on the advancement of LLM technology.
MaximoFN
MaximoFN
MaximoFN
Publisher Logo