RAG fundamentals

23 of october of 2024

RAG: Fundamentals and Advanced Techniques

Disclaimer: This post has been translated to English using a machine translation model. Please, let me know if you find any mistakes.

In this post, we will see what the RAG (Retrieval Augmented Generation) technique consists of and how it can be implemented in a language model.

To get it for free, instead of using an OpenAI account (as you will see in most tutorials), we are going to use the API inference from Hugging Face, which has a free tier of 1000 requests per day, which is more than enough to make this post.

Setting up the `API Inference` from Hugging Face

To use the HuggingFace API Inference, the first thing you need is to have a HuggingFace account. Once you have one, go to Access tokens in your profile settings and generate a new token. We need to give it a name, in my case I will name it rag-fundamentals and enable the permission Make calls to serverless Inference API. A token will be created that we need to copy.

To manage the token, we will create a file in the same path where we are working called .env and we will put the token we copied into the file as follows:

RAG_FUNDAMENTALS_ADVANCE_TECHNIQUES_TOKEN="hf_...."

Now, in order to obtain the token, we need to have dotenv installed, which we do by

pip install python-dotenv

and execute

	
		import os
import dotenv
 
dotenv.load_dotenv()
 
RAG_FUNDAMENTALS_ADVANCE_TECHNIQUES_TOKEN = os.getenv("RAG_FUNDAMENTALS_ADVANCE_TECHNIQUES_TOKEN")

Now that we have a token, we create a client, for this we need to have the huggingface_hub library installed, which we do using conda or pip

conda install -c conda-forge huggingface_hub

pip install --upgrade huggingface_hub

Now we have to choose which model we are going to use. You can see the available models on the Supported models page of the Hugging Face API Inference documentation. As of the time of writing this post, the best available is Qwen2.5-72B-Instruct, so we are going to use that model.

	
		MODEL = "Qwen/Qwen2.5-72B-Instruct"

Now we can create the client

	
		from huggingface_hub import InferenceClient
client = InferenceClient(api_key=RAG_FUNDAMENTALS_ADVANCE_TECHNIQUES_TOKEN, model=MODEL)
client

	
		<InferenceClient(model='Qwen/Qwen2.5-72B-Instruct', timeout=None)>

We do a test to see if it works

	
		message = [
	{ "role": "user", "content": "Hola, qué tal?" }
]
stream = client.chat.completions.create(
	messages=message, 
	temperature=0.5,
	max_tokens=1024,
	top_p=0.7,
	stream=False
)
response = stream.choices[0].message.content
print(response)

	
		¡Hola! Estoy bien, gracias por preguntar. ¿Cómo estás tú? ¿En qué puedo ayudarte hoy?

What is `RAG`?

RAG stands for Retrieval Augmented Generation, a technique created to obtain information from documents. Although LLMs can become very powerful and hold a lot of knowledge, they will never be able to answer questions about private documents, such as company reports, internal documentation, etc. That is why RAG was created, to be able to use these LLMs on such private documentation. What is RAG?

The idea is that a user asks a question about that private documentation, the system is able to retrieve the part of the documentation where the answer to that question is, the question and the part of the documentation are passed to a LLM and the LLM generates the answer for the user.

How is information stored?

It is well known, and if you didn't know, I'll tell you now, that LLMs have a limit to the amount of information that can be inputted, which is called the context window. This is due to the internal architectures of the LLMs, which is not relevant right now. But the important thing is that you can't just pass a document and a question to them, because it's likely that the LLM will not be able to process all that information. In cases where more information is typically passed than what its context window allows, what usually happens is that the LLM does not pay attention to the end of the input. Imagine you ask the LLM about something in your document, that this information is at the end of the document and the LLM does not read it. Therefore, what is done is to divide the documentation into blocks called chunks. So the documentation is stored in a bunch of chunks, which are pieces of that documentation. So when the user asks a question, the chunk containing the answer to that question is passed to the LLM. In addition to splitting the documentation into chunks, these are converted into embeddings, which are numerical representations of the chunks. This is because LLMs do not actually understand text, but rather numbers, and the chunks are converted into numbers so that the LLM can understand them. If you want to understand more about embeddings, you can read my post on transformers where I explain how transformers work, which is the architecture underlying the LLMs. You can also read my post on ChromaDB where I explain how embeddings are stored in a vector database. Additionally, it would be interesting for you to read my post on the HuggingFace Tokenizers library, where it is explained how text is tokenized, which is the step prior to generating embeddings. RAG - embeddings

How to get the correct `chunk`?

We have said that the documentation is divided into chunks and the chunk containing the answer to the user's question is passed to the LLM. But, how do we know in which chunk the answer is? For this, what is done is converting the user's question to an embedding, and the similarity between the embedding of the question and the embeddings of the chunks is calculated. Thus, the chunk with the highest similarity is the one that is passed to the LLM. RAG - embeddings similarity

Let's revisit what `RAG` is

On one hand, we have the retrieval, which is fetching the correct chunk from the documentation; on the other hand, we have the augmented, which is passing the user's question and the chunk to the LLM; and lastly, we have the generation, which is obtaining the response generated by the LLM.

Vector Database

We have seen that the documentation is divided into chunks and stored in a vector database, so we need to use one. For this post, I will use ChromaDB, which is a widely used vector database and I also have a post where I explain how it works.

So first we need to install the ChromaDB library, for this we install it with Conda or Pip

conda install conda-forge::chromadb

pip install chromadb

Embedding Function

As we mentioned, everything will be based on embeddings, so the first thing we do is create a function to obtain embeddings from a text. We are going to use the model sentence-transformers/all-MiniLM-L6-v2

	
		import chromadb.utils.embedding_functions as embedding_functions
EMBEDDING_MODEL = "sentence-transformers/all-MiniLM-L6-v2"
      
huggingface_ef = embedding_functions.HuggingFaceEmbeddingFunction(
    api_key=RAG_FUNDAMENTALS_ADVANCE_TECHNIQUES_TOKEN,
    model_name=EMBEDDING_MODEL
)

We test the embedding function

	
		embedding = huggingface_ef(["Hello, how are you?",])
embedding[0].shape

	
		(384,)

We obtain a 384-dimensional embedding. Although the purpose of this post is not to explain embeddings, in summary, our embedding function has categorized the phrase Hello, how are you? in a 384-dimensional space.

ChromaDB client

Now that we have our embedding function, we can create a ChromaDB client

First, we create a folder where the vector database will be stored.

	
		from pathlib import Path
      
chroma_path = Path("chromadb_persisten_storage")
chroma_path.mkdir(exist_ok=True)

Now we create the client

	
		from chromadb import PersistentClient
chroma_client = PersistentClient(path = str(chroma_path))

Collection

When we have the ChromaDB client, the next thing we need to do is create a collection. A collection is a set of vectors, in our case, the chunks of the documentation.

We create it by indicating the embedding function we are going to use

	
		collection_name = "document_qa_collection"
collection = chroma_client.get_or_create_collection(name=collection_name, embedding_function=huggingface_ef)

Document Upload

Now that we have created the vector database, we need to split the documentation into chunks and store them in the vector database.

Document Upload Function

First, we create a function to load all .txt documents from a directory

	
		def load_one_document_from_directory(directory, file):
    with open(os.path.join(directory, file), "r") as f:
        return {"id": file, "text": f.read()}
 
def load_documents_from_directory(directory):
    documents = []
    for file in os.listdir(directory):
        if file.endswith(".txt"):
            documents.append(load_one_document_from_directory(directory, file))
    return documents

Function to split the documentation into `chunks`

Once we have the documents, we divide them into chunks

	
		def split_text(text, chunk_size=1000, chunk_overlap=20):
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunks.append(text[start:end])
        start = end - chunk_overlap
    return chunks

Function to generate embeddings of a `chunk`

Now that we have the chunks, we generate the embeddings for each of them

Later we will see why, but to generate the embeddings we will do it locally and not through the Hugging Face API. For this, we need to have PyTorch and sentence-transformers installed, for this we do

pip install -U sentence-transformers

	
		from sentence_transformers import SentenceTransformer
import torch
 
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
 
embedding_model = SentenceTransformer(EMBEDDING_MODEL).to(device)
 
def get_embeddings(text):
    try:
        embedding = embedding_model.encode(text, device=device)
        return embedding
    except Exception as e:
        print(f"Error: {e}")
        exit(1)

Let's now test this embeddings function locally

	
		text = "Hello, how are you?"
embedding = get_embeddings(text)
embedding.shape

	
		(384,)

We see that we obtain an embedding of the same dimension as when we did it with the Hugging Face API

The sentence-transformers/all-MiniLM-L6-v2 model has only 22M parameters, so you will be able to run it on any GPU. Even if you do not have a GPU, you will be able to run it on a CPU.

The LLM we are going to use to generate the responses, which is Qwen2.5-72B-Instruct, as its name indicates, is a model with 72B parameters, so this model cannot be run on any GPU and on a CPU it is unthinkable due to how slow it would be. Therefore, we will use this LLM via the API, but when generating the embeddings, we can do it locally without any problem.

Documents with which we are going to test

To perform all these tests, I downloaded the dataset aws-case-studies-and-blogs and placed it in the rag-txt_dataset folder. With the following commands, I'll show you how to download and unzip it.

We create the folder where we are going to download the documents

	
		!mkdir rag_txt_dataset

We download the .zip with the documents

	
		!curl -L -o ./rag_txt_dataset/archive.zip https://www.kaggle.com/api/v1/datasets/download/harshsinghal/aws-case-studies-and-blogs

	
		  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 1430k  100 1430k    0     0  1082k      0  0:00:01  0:00:01 --:--:-- 2440k

We unzip the .zip

	
		!unzip rag_txt_dataset/archive.zip -d rag_txt_dataset

	
		Archive:  rag_txt_dataset/archive.zip
  inflating: rag_txt_dataset/23andMe Case Study _ Life Sciences _ AWS.txt  
  inflating: rag_txt_dataset/36 new or updated datasets on the Registry of Open Data_ AI analysis-ready datasets and more _ AWS Public Sector Blog.txt  
  inflating: rag_txt_dataset/54gene _ Case Study _ AWS.txt  
  inflating: rag_txt_dataset/6sense Case Study.txt  
  inflating: rag_txt_dataset/ADP Developed an Innovative and Secure Digital Wallet in a Few Months Using AWS Services _ Case Study _ AWS.txt  
  inflating: rag_txt_dataset/AEON Case Study.txt  
  inflating: rag_txt_dataset/ALTBalaji _ Amazon Web Services.txt  
  inflating: rag_txt_dataset/AWS Case Study - Ineos Team UK.txt  
  inflating: rag_txt_dataset/AWS Case Study - StreamAMG.txt  
  inflating: rag_txt_dataset/AWS Case Study_ Creditsafe.txt  
  inflating: rag_txt_dataset/AWS Case Study_ Immowelt.txt  
  inflating: rag_txt_dataset/AWS Customer Case Study _ Kepler Provides Effective Monitoring of Elderly Care Home Residents Using AWS _ AWS.txt  
  inflating: rag_txt_dataset/AWS announces 21 startups selected for the AWS generative AI accelerator _ AWS Startups Blog.txt  
  inflating: rag_txt_dataset/AWS releases smart meter data analytics _ AWS for Industries.txt  
  inflating: rag_txt_dataset/Accelerate Time to Business Value Using Amazon SageMaker at Scale with NatWest Group _ Case Study _ AWS.txt  
  ...  
  inflating: rag_txt_dataset/e-banner Streamlines Its Contact Center Operations and Facilitates a Fully Remote Workforce with Amazon Connect _ e-banner Case Study _ AWS.txt  
  inflating: rag_txt_dataset/iptiQ Case Study.txt  
  inflating: rag_txt_dataset/mod.io Provides Low Latency Gamer Experience Globally on AWS _ Case Study _ AWS.txt  
  inflating: rag_txt_dataset/myposter Case Study.txt

We delete the .zip

	
		!rm rag_txt_dataset/archive.zip

Let's see what we have left

	
		!ls rag_txt_dataset

	
		'23andMe Case Study _ Life Sciences _ AWS.txt'
'36 new or updated datasets on the Registry of Open Data_ AI analysis-ready datasets and more _ AWS Public Sector Blog.txt'
'54gene _ Case Study _ AWS.txt'
'6sense Case Study.txt'
'Accelerate Time to Business Value Using Amazon SageMaker at Scale with NatWest Group _ Case Study _ AWS.txt'
'Accelerate Your Analytics Journey on AWS with DXC Analytics and AI Platform _ AWS Partner Network (APN) Blog.txt'
'Accelerating customer onboarding using Amazon Connect _ NCS Case Study _ AWS.txt'
'Accelerating Migration at Scale Using AWS Application Migration Service with 3M Company _ Case Study _ AWS.txt'
'Accelerating Time to Market Using AWS and AWS Partner AccelByte _ Omeda Studios Case Study _ AWS.txt'
'Achieving Burstable Scalability and Consistent Uptime Using AWS Lambda with TiVo _ Case Study _ AWS.txt'
'Acrobits Uses Amazon Chime SDK to Easily Create Video Conferencing Application Boosting Collaboration for Global Users _ Acrobits Case Study _ AWS.txt'
'Actuate AI Case study.txt'
'ADP Developed an Innovative and Secure Digital Wallet in a Few Months Using AWS Services _ Case Study _ AWS.txt'
'Adzuna doubles its email open rates using Amazon SES _ Adzuna Case Study _ AWS.txt'
'AEON Case Study.txt'
'ALTBalaji _ Amazon Web Services.txt'
'Amanotes Stays on Beat by Delivering Simple Music Games to Millions Worldwide on AWS.txt'
'Amazon OpenSearch Services vector database capabilities explained _ AWS Big Data Blog.txt'
'Anghami Case Study.txt'
'Announcing enhanced table extractions with Amazon Textract _ AWS Machine Learning Blog.txt'
'AppsFlyer Amazon EKS Case Study _ Advertising _ AWS.txt'
'Arm Case Study.txt'
'Arm Limited Case Study.txt'
'Armitage Technologies case study.txt'
'...'
'Which Recurring Business Processes Can Small and Medium Businesses Automate_ _ AWS Smart Business Blog.txt'
 Windsor.txt
'Wireless Car Case Study _ AWS IoT Core _ AWS.txt'
'Yamato Logistics (HK) case study.txt'
'Zomato Saves Big by Using AWS Graviton2 to Power Data-Driven Business Insights.txt'
'Zoox Case Study _ Automotive _ AWS.txt'

Let's create the `chunks`!

We list the documents with the function we had created

	
		dataset_path = "rag_txt_dataset"
documents = load_documents_from_directory(dataset_path)

We check that we have done it right

	
		for document in documents[0:10]:
    print(document["id"])

	
		Run Jobs at Scale While Optimizing for Cost Using Amazon EC2 Spot Instances with ActionIQ _ ActionIQ Case Study _ AWS.txt
Recommend and dynamically filter items based on user context in Amazon Personalize _ AWS Machine Learning Blog.txt
Windsor.txt
Bank of Montreal Case Study _ AWS.txt
The Mill Adventure Case Study.txt
Optimize software development with Amazon CodeWhisperer _ AWS DevOps Blog.txt
Announcing enhanced table extractions with Amazon Textract _ AWS Machine Learning Blog.txt
THREAD _ Life Sciences _ AWS.txt
Deep Pool Optimizes Software Quality Control Using Amazon QuickSight _ Deep Pool Case Study _ AWS.txt
Upstox Saves 1 Million Annually Using Amazon S3 Storage Lens _ Upstox Case Study _ AWS.txt

Now we create the chunks.

	
		chunked_documents = []
for document in documents:
    chunks = split_text(document["text"])
    for i, chunk in enumerate(chunks):
        chunked_documents.append({"id": f"{document['id']}_{i}", "text": chunk})

	
		len(chunked_documents)

As we can see, there are 3611 chunks. Since the daily limit of the Hugging Face API is 1000 calls on the free account, if we want to create embeddings for all the chunks, we would run out of available calls and also wouldn't be able to create embeddings for all the chunks.

We remind you once again, this embedding model is very small, only 22M parameters, so it can run on almost any computer, faster or slower, but it can run.

Since we are only going to create the embeddings of the chunks once, even if we don't have a very powerful computer and it takes a long time, it will only be executed once. Then, when we want to ask questions about the documentation, we will generate the embeddings of the prompt with the Hugging Face API and use the LLM with the API. Therefore, we will only have to go through the process of generating the embeddings of the chunks once.

We generate the embeddings of the chunks

Last library we are going to need to install. Since the process of generating embeddings from the chunks will be slow, we are going to install tqdm to show us a progress bar. We can install it with conda or pip, whichever you prefer.

conda install conda-forge::tqdm

pip install tqdm

We generate the embeddings of the chunks

	
		import tqdm
progress_bar = tqdm.tqdm(chunked_documents)
for chunk in progress_bar:
    embedding = get_embeddings(chunk["text"])
    if embedding is not None:
        chunk["embedding"] = embedding
    else:
        print(f"Error with document {chunk['id']}")

	
		100%|██████████| 3611/3611 [00:16<00:00, 220.75it/s]

We see an example

	
		from random import randint
idx = randint(0, len(chunked_documents))
print(f"Chunk id: {chunked_documents[idx]['id']},\n\ntext: {chunked_documents[idx]['text']},\n\nembedding shape: {chunked_documents[idx]['embedding'].shape}")

	
		Chunk id: BNS Group Case Study _ Amazon Web Services.txt_0,
text: Reducing Virtual Machines from 40 to 12
The founders of BNS had been contemplating a migration from the company’s on-premises data center to the public cloud and observed a growing demand for cloud-based operations among current and potential BNS customers.
Français
Configures security according to cloud best practices
Clive Pereira, R&D director at BNS Group, explains, “The database that records Praisal’s SMS traffic resides in Praisal’s AWS environment. Praisal can now run complete analytics across its data and gain insights into what’s happening with its SMS traffic, which is a real game-changer for the organization.”  
Español
 AWS ISV Accelerate Program
 Receiving Strategic, Foundational Support from ISV Specialists
 Learn More
The value that AWS places on the ISV stream sealed the deal in our choice of cloud provider.” 
日本語
  Contact Sales 
BNS is an Australian software provider focused on secure enterprise SMS and fax messaging. Its software runs on the Windows platform and is l,
embedding shape: (384,)

Load the `chunk`s into the vector database

Once we have all the chunks generated, we load them into the vector database. We use tqdm again to display a progress bar, because this will also be slow.

	
		import tqdm
progress_bar = tqdm.tqdm(chunked_documents)
for chunk in progress_bar:
    collection.upsert(
        ids=[chunk["id"]],
        documents=chunk["text"],
        embeddings=chunk["embedding"],
    )

	
		100%|██████████| 3611/3611 [00:59<00:00, 60.77it/s]

Questions

Now that we have the vector database, we can ask questions to the documentation. For this, we need a function that returns the correct chunk.

Getting the correct `chunk`

Now we need a function that returns the correct chunk, let's create it

	
		def get_top_k_documents(query, k=5):
    results = collection.query(query_texts=query, n_results=k)
    return results

Lastly, we create a query. To generate the query, I randomly picked the document Using Amazon EC2 Spot Instances and Karpenter to Simplify and Optimize Kubernetes Infrastructure _ Neeva Case Study _ AWS.txt, passed it to a LLM, and asked it to generate a question about the document. The question it generated is ¿Cómo utilizó Neeva Karpenter y las Instancias Spot de Amazon EC2 para mejorar su gestión de infraestructura y optimización de costos? So we obtain the most relevant chunks for that question

	
		query = "How did Neeva use Karpenter and Amazon EC2 Spot Instances to improve its infrastructure management and cost optimization?"
top_chunks = get_top_k_documents(query=query, k=5)

Let's see what chunks have been returned to us

	
		for i in range(len(top_chunks["ids"][0])):
    print(f"Rank {i+1}: {top_chunks['ids'][0][i]}, distance: {top_chunks['distances'][0][i]}")

	
		Rank 1: Using Amazon EC2 Spot Instances and Karpenter to Simplify and Optimize Kubernetes Infrastructure _ Neeva Case Study _ AWS.txt_0, distance: 0.29233667254447937
Rank 2: Using Amazon EC2 Spot Instances and Karpenter to Simplify and Optimize Kubernetes Infrastructure _ Neeva Case Study _ AWS.txt_5, distance: 0.4007825255393982
Rank 3: Using Amazon EC2 Spot Instances and Karpenter to Simplify and Optimize Kubernetes Infrastructure _ Neeva Case Study _ AWS.txt_1, distance: 0.4317566752433777
Rank 4: Using Amazon EC2 Spot Instances and Karpenter to Simplify and Optimize Kubernetes Infrastructure _ Neeva Case Study _ AWS.txt_6, distance: 0.43832334876060486
Rank 5: Using Amazon EC2 Spot Instances and Karpenter to Simplify and Optimize Kubernetes Infrastructure _ Neeva Case Study _ AWS.txt_4, distance: 0.44625571370124817

As I had mentioned, the document I had randomly chosen was Using Amazon EC2 Spot Instances and Karpenter to Simplify and Optimize Kubernetes Infrastructure _ Neeva Case Study _ AWS.txt and as you can see, the chunks it returned to us are from that document. In other words, out of more than 3000 chunks that were in the database, it was able to return the most relevant chunks for that question, it seems like this works!

Generating the response

As we already have the most relevant chunks, we pass them to the LLM along with the question, so that it can generate a response.

	
		def generate_response(query, relevant_chunks, temperature=0.5, max_tokens=1024, top_p=0.7, stream=False):
    context = "\n\n".join([chunk for chunk in relevant_chunks])
    prompt = f"You are an assistant for question-answering. You have to answer the following question:\n\n{query}\n\nAnswer the question with the following information:\n\n{context}"
    message = [
        { "role": "user", "content": prompt }
    ]
    stream = client.chat.completions.create(
        messages=message, 
        temperature=temperature,
        max_tokens=max_tokens,
        top_p=top_p,
        stream=stream,
    )
    response = stream.choices[0].message.content
    return response

We test the function

	
		response = generate_response(query, top_chunks["documents"][0])
print(response)

Neeva, a cloud-native, ad-free search engine founded in 2019, has leveraged Karpenter and Amazon EC2 Spot Instances to significantly improve its infrastructure management and cost optimization. Here’s how:
### Early Collaboration with Karpenter
In late 2021, Neeva began working closely with the Karpenter team, experimenting with and contributing fixes to an early version of Karpenter. This collaboration allowed Neeva to integrate Karpenter with its Kubernetes dashboard, enabling the company to gather valuable metrics on usage and performance.
### Combining Spot Instances and On-Demand Instances
Neeva runs its jobs on a large scale, which can lead to significant costs. To manage these costs effectively, the company adopted a combination of Amazon EC2 Spot Instances and On-Demand Instances. Spot Instances allow Neeva to bid on unused EC2 capacity, often at a fraction of the On-Demand price, while On-Demand Instances provide the necessary reliability for critical pipelines.
### Flexibility and Instance Diversification
According to Mohit Agarwal, infrastructure engineering lead at Neeva, Karpenter's adoption of best practices for Spot Instances, including flexibility and instance diversification, has been crucial. This approach ensures that Neeva can dynamically adjust its compute resources to meet varying workloads while minimizing costs.
### Improved Scalability and Agility
By using Karpenter to provision infrastructure resources for its Amazon EKS clusters, Neeva has achieved several key benefits:
- **Scalability**: Neeva can scale its compute resources up or down as needed, ensuring that it always has the necessary capacity to handle its workloads.
- **Agility**: The company can iterate quickly and democratize infrastructure changes, reducing the time spent on systems administration by up to 100 hours per week.
### Enhanced Development Cycles
The integration of Karpenter and Spot Instances has also accelerated Neeva's development cycles. The company can now launch new features and improvements more rapidly, which is essential for maintaining a competitive edge in the search engine market.
### Cost Savings and Budget Control
Using Spot Instances, Neeva has been able to stay within its budget while meeting its performance requirements. This cost optimization is critical for a company that prioritizes user-first experiences and has no competing incentives from advertising.
### Future Plans
Neeva is committed to continuing its innovation and expansion. The company plans to launch in new regions and further improve its search engine, all while maintaining cost efficiency. As Mohit Agarwal notes, "The bulk of our compute is or will be managed using Karpenter going forward."
### Conclusion
By leveraging Karpenter and Amazon EC2 Spot Instances, Neeva has not only optimized its infrastructure costs but also enhanced its scalability, agility, and development speed. This strategic approach has positioned Neeva to deliver high-quality, ad-free search experiences to its users while maintaining a strong focus on cost control and innovation.

When I asked the LLM to generate a question about the document, I also asked it to generate the correct answer. This is the answer the LLM gave me.

Neeva used Karpenter and Amazon EC2 Spot Instances to improve its infrastructure management and cost optimization in several ways:
Simplified Instance Management:

Karpenter: By adopting Karpenter, Neeva simplified the process of provisioning and managing compute resources for its Amazon EKS clusters. Karpenter automatically provisions and de-provisions instances based on the workload, eliminating the need for manual configurations and reducing the complexity of understanding different compute instances.
Spot Instances: Neeva leveraged Amazon EC2 Spot Instances, which are unused EC2 capacity available at a significant discount (up to 90% cost savings). This allowed the company to control costs while meeting its performance requirements.
Enhanced Scalability:

Karpenter: Karpenter's ability to dynamically scale resources enabled Neeva to spin up new instances quickly, allowing the company to iterate at a higher velocity and run more experiments in less time.Spot Instances: The use of Spot Instances provided flexibility and instance diversification, making it easier for Neeva to scale its compute resources efficiently.
Improved Productivity:

Karpenter: By democratizing infrastructure changes, Karpenter allowed any engineer to modify Kubernetes configurations, reducing the dependency on specialized expertise. This saved the Neeva team up to 100 hours per week of wait time on systems administration.
Spot Instances: The ability to quickly provision and de-provision Spot Instances reduced delays in the development pipeline, ensuring that jobs did not get stuck due to a lack of available resources.
Cost Efficiency:

Karpenter: Karpenter's best practices for Spot Instances, including flexibility and instance diversification, helped Neeva use these instances more effectively, staying within budget.
Spot Instances: The cost savings from using Spot Instances allowed Neeva to run large-scale jobs, such as indexing, for nearly the same cost but in a fraction of the time. For example, Neeva reduced its indexing jobs from 18 hours to just 3 hours.
Better Resource Utilization:

Karpenter: Karpenter provided better visibility into compute resource usage, allowing Neeva to track and optimize its resource consumption more closely.
Spot Instances: The combination of Karpenter and Spot Instances enabled Neeva to run large language models more efficiently, enhancing the search experience for its users.
In summary, Neeva's adoption of Karpenter and Amazon EC2 Spot Instances significantly improved its infrastructure management, cost optimization, and overall development efficiency, enabling the company to deliver better ad-free search experiences to its users.

And this has been the response generated by our `RAG`

Neeva, a cloud-native, ad-free search engine founded in 2019, has leveraged Karpenter and Amazon EC2 Spot Instances to significantly improve its infrastructure management and cost optimization. Here’s how:
### Early Collaboration with KarpenterIn late 2021, Neeva began working closely with the Karpenter team, experimenting with and contributing fixes to an early version of Karpenter. This collaboration allowed Neeva to integrate Karpenter with its Kubernetes dashboard, enabling the company to gather valuable metrics on usage and performance.

### Combining Spot Instances and On-Demand Instances
Neeva runs its jobs on a large scale, which can lead to significant costs. To manage these costs effectively, the company adopted a combination of Amazon EC2 Spot Instances and On-Demand Instances. Spot Instances allow Neeva to bid on unused EC2 capacity, often at a fraction of the On-Demand price, while On-Demand Instances provide the necessary reliability for critical pipelines.

### Flexibility and Instance Diversification
According to Mohit Agarwal, infrastructure engineering lead at Neeva, Karpenter's adoption of best practices for Spot Instances, including flexibility and instance diversification, has been crucial. This approach ensures that Neeva can dynamically adjust its compute resources to meet varying workloads while minimizing costs.

### Improved Scalability and Agility
By using Karpenter to provision infrastructure resources for its Amazon EKS clusters, Neeva has achieved several key benefits:
- **Scalability**: Neeva can scale its compute resources up or down as needed, ensuring that it always has the necessary capacity to handle its workloads.
- **Agility**: The company can iterate quickly and democratize infrastructure changes, reducing the time spent on systems administration by up to 100 hours per week.

### Enhanced Development Cycles
The integration of Karpenter and Spot Instances has also accelerated Neeva's development cycles. The company can now launch new features and improvements more rapidly, which is essential for maintaining a competitive edge in the search engine market.

### Cost Savings and Budget Control
Using Spot Instances, Neeva has been able to stay within its budget while meeting its performance requirements. This cost optimization is critical for a company that prioritizes user-first experiences and has no competing incentives from advertising.

### Future Plans
Neeva is committed to continuing its innovation and expansion. The company plans to launch in new regions and further improve its search engine, all while maintaining cost efficiency. As Mohit Agarwal notes, "The bulk of our compute is or will be managed using Karpenter going forward."

### Conclusion
By leveraging Karpenter and Amazon EC2 Spot Instances, Neeva has not only optimized its infrastructure costs but also enhanced its scalability, agility, and development speed. This strategic approach has positioned Neeva to deliver high-quality, ad-free search experiences to its users while maintaining a strong focus on cost control and innovation.

So we can conclude that the `RAG` has worked correctly!!!

Continue reading

Conventional commits

😠 Are your commits written in alien language? 👽 Join the club! 😅 Learn Conventional Commits in Python and stop torturing your team with cryptic messages. git-changelog and commitizen will be your new best friends. 🤝

DoLa – Decoding by Contrasting Layers Improves Factuality in Large Language Models

Have you ever talked to an LLM and they answered you something that sounds like they've been drinking machine coffee all night long 😂 That's what we call a hallucination in the LLM world! But don't worry, because it's not that your language model is crazy (although it can sometimes seem that way 🤪). The truth is that LLMs can be a bit... creative when it comes to generating text. But thanks to DoLa, a method that uses contrast layers to improve the feasibility of LLMs, we can keep our language models from turning into science fiction writers 😂. In this post, I'll explain how DoLa works and show you a code example so you can better understand how to make your LLMs more reliable and less prone to making up stories. Let's save our LLMs from insanity and make them more useful! 🚀

QLoRA: Efficient Finetuning of Quantized LLMs

Hello everyone! 🤗 Today we are going to talk about QLoRA, the technique that will allow you to make your language models more efficient and faster ⏱️. But how does it do it? 🤔 Well, first it uses quantization to reduce the size of the model weights, which saves memory and speed 📈. Then, it applies LoRA (Low-Rank Adaptation), which is like a superpower that allows the model to adapt to new data without retraining from scratch 💪. And, for you to see how it works in practice, I leave you with a code example that will make you say 'Eureka!' 🎉 Let's dive into the world of QLoRA and discover how we can make our models smarter and more efficient! 🤓

Last posts -->

Have you seen these projects?

Subtify

GPU monitor

Alfred

View all projects -->

Do you want to apply AI in your project? Contact me!

Do you want to improve with these tips?

Declarar redes neuronales de forma clara

Declare neural networks clearly in Pytorch

Obtener datos de diccionarios

Obtener datos de diccionarios en Python de forma segura

Single-element tuples in Python

Tip to create single-element tuples in Python

Last tips -->

Use this locally

Hugging Face spaces allow us to run models with very simple demos, but what if the demo breaks? Or if the user deletes it? That's why I've created docker containers with some interesting spaces, to be able to use them locally, whatever happens. In fact, if you click on any project view button, it may take you to a space that doesn't work.

token_hmr

tooncrafter

marker_api

View all containers -->

Do you want to apply AI in your project? Contact me!

Do you want to train your model with these datasets?

short-jokes-dataset

Dataset with jokes in English

opus100

Dataset with translations from English to Spanish

netflix_titles

Dataset with Netflix movies and series

View more datasets -->

RAG: Fundamentals and Advanced Techniques

Setting up the API Inference from Hugging Face

What is RAG?

How is information stored?

How to get the correct chunk?

Let's revisit what RAG is

Vector Database

Embedding Function

ChromaDB client

Collection

Document Upload

Document Upload Function

Function to split the documentation into chunks

Function to generate embeddings of a chunk

Documents with which we are going to test

Let's create the chunks!

Load the chunks into the vector database

Questions

Getting the correct chunk

Generating the response

Continue reading

Conventional commits

DoLa – Decoding by Contrasting Layers Improves Factuality in Large Language Models

QLoRA: Efficient Finetuning of Quantized LLMs

Have you seen these projects?

Do you want to apply AI in your project? Contact me!

Do you want to improve with these tips?

Declarar redes neuronales de forma clara

Obtener datos de diccionarios

Single-element tuples in Python

Use this locally

Do you want to apply AI in your project? Contact me!

Do you want to train your model with these datasets?

short-jokes-dataset

opus100

netflix_titles

Setting up the `API Inference` from Hugging Face

What is `RAG`?

How to get the correct `chunk`?

Let's revisit what `RAG` is

Function to split the documentation into `chunks`

Function to generate embeddings of a `chunk`

Let's create the `chunks`!

Load the `chunk`s into the vector database

Getting the correct `chunk`