RAG fundamentals

RAG fundamentals RAG fundamentals

RAG: Foundations and Naive RAGlink image 22

Disclaimer: This post has been translated to English using a machine translation model. Please, let me know if you find any mistakes.

In this post, we will explore what the RAG (Retrieval Augmented Generation) technique entails and how it can be implemented in a language model. Additionally, we will do this using the most basic RAG architecture, called naive RAG.

To make it free, instead of using an OpenAI account (as you'll see in most tutorials), we're going to use the API inference from Hugging Face, which has a free tier of 1000 requests per day, more than enough for this post.

Setting up the API Inference of Hugging Facelink image 23

To be able to use the API Inference from HuggingFace, the first thing you need is to have an account on HuggingFace. Once you have one, you need to go to Access tokens in your profile settings and generate a new token.

We need to give it a name. In my case, I'm going to call it rag-fundamentals and enable the Make calls to serverless Inference API permission. It will create a token that we need to copy.

To manage the token, we are going to create a file in the same path where we are working called .env and we are going to put the token that we have copied into the file in the following way:

RAG_FUNDAMENTALS_ADVANCE_TECHNIQUES_TOKEN="hf_...."
      

Now to obtain the token we need to have dotenv installed, which we install through

pip install python-dotenv
      

And we run the following

	
import os
import dotenv
dotenv.load_dotenv()
RAG_FUNDAMENTALS_ADVANCE_TECHNIQUES_TOKEN = os.getenv("RAG_FUNDAMENTALS_ADVANCE_TECHNIQUES_TOKEN")
Copy

Now that we have a token, we create a client. For this, we need to have the huggingface_hub library installed. We install it using conda or pip.

conda install -c conda-forge huggingface_hub
      

o

pip install --upgrade huggingface_hub
      

Now we have to choose which model to use. You can see the available models on the Supported models page of the Hugging Face API Inference documentation.

As of the time of writing the post, the best available is Qwen2.5-72B-Instruct, so we will use that model.

	
MODEL = "Qwen/Qwen2.5-72B-Instruct"
Copy

Now we can create the client

	
from huggingface_hub import InferenceClient
client = InferenceClient(api_key=RAG_FUNDAMENTALS_ADVANCE_TECHNIQUES_TOKEN, model=MODEL)
client
Copy
	
<InferenceClient(model='Qwen/Qwen2.5-72B-Instruct', timeout=None)>

We run a test to see if it works

	
message = [
{ "role": "user", "content": "Hola, qué tal?" }
]
stream = client.chat.completions.create(
messages=message,
temperature=0.5,
max_tokens=1024,
top_p=0.7,
stream=False
)
response = stream.choices[0].message.content
print(response)
Copy
	
¡Hola! Estoy bien, gracias por preguntar. ¿Cómo estás tú? ¿En qué puedo ayudarte hoy?

What is RAG?link image 24

RAG stands for Retrieval Augmented Generation, it is a technique created to obtain information from documents. Although LLMs can be very powerful and have a lot of knowledge, they will never be able to respond about private documents, such as reports from your company, internal documentation, etc. That's why RAG was created, to be able to use these LLMs on that private documentation.

What is RAG?

The idea is that a user asks a question about that private documentation, the system is able to retrieve the part of the documentation where the answer to that question is found, the question and the relevant part of the documentation are passed to an LLM, and the LLM generates the answer for the user.

How is information stored?link image 25

It is known, and if you didn't know, I'll tell you now, that LLMs have a limit to the amount of information that can be passed to them, this is called the context window. This is due to internal architectures of LLMs that are not relevant at the moment. But what's important is that you cannot simply pass a document and a question to an LLM, because it is likely that the LLM will not be able to process all that information.

In cases where more information is usually passed than the context window allows, what typically happens is that the LLM does not pay attention to the end of the input. Imagine asking the LLM about something in your document, and that information is at the end of the document, but the LLM doesn't read it.

Therefore, what is done is to divide the documentation into blocks called chunks. So, the documentation is stored in a bunch of chunks, which are pieces of that documentation. This way, when the user asks a question, the chunk containing the answer to that question is passed to the LLM.

In addition to dividing the documentation into chunks, these are converted into embeddings, which are numerical representations of the chunks. This is done because LLMs actually do not understand text, but rather numbers, and the chunks are converted into numbers so that the LLM can understand them. If you want to learn more about embeddings, you can read my post on transformers where I explain how transformers work, which is the architecture underlying LLMs. You can also read my post on ChromaDB where I explain how embeddings are stored in a vector database. Additionally, it would be interesting for you to read my post about the HuggingFace Tokenizers library, which explains how text is tokenized, which is the step prior to generating embeddings.

RAG - embeddings

How do you get the correct chunk?link image 26

We have said that the documentation is divided into chunks and the chunk containing the answer to the user's question is passed to the LLM. But, how do we know in which chunk the answer is? To determine this, the user's question is converted into an embedding, and the similarity between the question's embedding and the embeddings of the chunks is calculated. The chunk with the highest similarity is the one passed to the LLM.

Let's revisit what RAG islink image 27

On one hand, we have the retrieval, which is obtaining the correct chunk from the documentation. On the other hand, we have the augmented, which involves passing the user's question and the chunk to the LLM. Finally, we have the generation, which is obtaining the response generated by the LLM.

Vector databaselink image 28

We have seen that the documentation is divided into chunks and stored in a vector database, so we need to use one. For this post, I will use ChromaDB, which is a widely used vector database and I also have a post where I explain how it works.

So first we need to install the ChromaDB library, for this we install it with Conda or with pip

conda install conda-forge::chromadb
      

o

pip install chromadb
      

Embedding Functionlink image 29

As we have said, everything will be based on embeddings. So, the first thing we do is create a function to get embeddings from a text. We are going to use the model sentence-transformers/all-MiniLM-L6-v2.

	
import chromadb.utils.embedding_functions as embedding_functions
EMBEDDING_MODEL = "sentence-transformers/all-MiniLM-L6-v2"
huggingface_ef = embedding_functions.HuggingFaceEmbeddingFunction(
api_key=RAG_FUNDAMENTALS_ADVANCE_TECHNIQUES_TOKEN,
model_name=EMBEDDING_MODEL
)
Copy

We test the embedding function

	
embedding = huggingface_ef(["Hello, how are you?",])
embedding[0].shape
Copy
	
(384,)

We obtain an embedding of dimension 384. Although the purpose of this post is not to explain embeddings, in short, our embedding function has categorized the phrase Hello, how are you? into a 384-dimensional space.

ChromaDB clientlink image 30

Now that we have our embedding function, we can create a ChromaDB client.

First we create a folder where the vector database will be stored

	
from pathlib import Path
chroma_path = Path("chromadb_persisten_storage")
chroma_path.mkdir(exist_ok=True)
Copy

Now we create the client

	
from chromadb import PersistentClient
chroma_client = PersistentClient(path = str(chroma_path))
Copy

Collectionlink image 31

When we have the ChromaDB client, the next thing we need to do is create a collection. A collection is a set of vectors, in our case the chunks of the documentation.

We create it by indicating the embedding function we are going to use

	
collection_name = "document_qa_collection"
collection = chroma_client.get_or_create_collection(name=collection_name, embedding_function=huggingface_ef)
Copy

Document Loadinglink image 32

Now that we have created the vector database, we need to split the documentation into chunks and save them in the vector database.

Document Loading Functionlink image 33

First we create a function to load all the .txt documents from a directory

	
def load_one_document_from_directory(directory, file):
with open(os.path.join(directory, file), "r") as f:
return {"id": file, "text": f.read()}
def load_documents_from_directory(directory):
documents = []
for file in os.listdir(directory):
if file.endswith(".txt"):
documents.append(load_one_document_from_directory(directory, file))
return documents
Copy

Function to split the documentation into chunkslink image 34

Once we have the documents, we divide them into chunks

	
def split_text(text, chunk_size=1000, chunk_overlap=20):
chunks = []
start = 0
while start < len(text):
end = start + chunk_size
chunks.append(text[start:end])
start = end - chunk_overlap
return chunks
Copy

Function to generate embeddings for a chunklink image 35

Now that we have the chunks, we generate the embeddings for each of them.

We will see why later, but to generate the embeddings we are going to do it locally and not through the Hugging Face API. For this, we need to have PyTorch and sentence-transformers installed, for which we do

pip install -U sentence-transformers
      
	
from sentence_transformers import SentenceTransformer
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
embedding_model = SentenceTransformer(EMBEDDING_MODEL).to(device)
def get_embeddings(text):
try:
embedding = embedding_model.encode(text, device=device)
return embedding
except Exception as e:
print(f"Error: {e}")
exit(1)
Copy

Let's test this embedding function locally now

	
text = "Hello, how are you?"
embedding = get_embeddings(text)
embedding.shape
Copy
	
(384,)

We see that we obtain an embedding of the same dimension as when we did it with the Hugging Face API

The model sentence-transformers/all-MiniLM-L6-v2 has only 22M parameters, so you will be able to run it on any GPU. Even if you don't have a GPU, you will be able to run it on a CPU.

The LLM we will use to generate the responses, which is Qwen2.5-72B-Instruct, as its name suggests, is a 72B parameter model, so this model cannot be run on any GPU and running it on a CPU is unthinkable due to how slow it would be. Therefore, we will use this LLM via the API, but when generating the embeddings we can do it locally without any problem.

Documents We Will Be Testing Withlink image 36

To perform all these tests, I downloaded the dataset aws-case-studies-and-blogs and placed it in the rag-txt_dataset folder. The following commands tell you how to download and extract it.

We create the folder where we are going to download the documents

	
!mkdir rag_txt_dataset
Copy

We download the .zip with the documents

	
!curl -L -o ./rag_txt_dataset/archive.zip https://www.kaggle.com/api/v1/datasets/download/harshsinghal/aws-case-studies-and-blogs
Copy
	
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
100 1430k 100 1430k 0 0 1082k 0 0:00:01 0:00:01 --:--:-- 2440k

We decompress the .zip

	
!unzip rag_txt_dataset/archive.zip -d rag_txt_dataset
Copy
	
Archive: rag_txt_dataset/archive.zip
inflating: rag_txt_dataset/23andMe Case Study _ Life Sciences _ AWS.txt
inflating: rag_txt_dataset/36 new or updated datasets on the Registry of Open Data_ AI analysis-ready datasets and more _ AWS Public Sector Blog.txt
inflating: rag_txt_dataset/54gene _ Case Study _ AWS.txt
inflating: rag_txt_dataset/6sense Case Study.txt
inflating: rag_txt_dataset/ADP Developed an Innovative and Secure Digital Wallet in a Few Months Using AWS Services _ Case Study _ AWS.txt
inflating: rag_txt_dataset/AEON Case Study.txt
inflating: rag_txt_dataset/ALTBalaji _ Amazon Web Services.txt
inflating: rag_txt_dataset/AWS Case Study - Ineos Team UK.txt
inflating: rag_txt_dataset/AWS Case Study - StreamAMG.txt
inflating: rag_txt_dataset/AWS Case Study_ Creditsafe.txt
inflating: rag_txt_dataset/AWS Case Study_ Immowelt.txt
inflating: rag_txt_dataset/AWS Customer Case Study _ Kepler Provides Effective Monitoring of Elderly Care Home Residents Using AWS _ AWS.txt
inflating: rag_txt_dataset/AWS announces 21 startups selected for the AWS generative AI accelerator _ AWS Startups Blog.txt
inflating: rag_txt_dataset/AWS releases smart meter data analytics _ AWS for Industries.txt
inflating: rag_txt_dataset/Accelerate Time to Business Value Using Amazon SageMaker at Scale with NatWest Group _ Case Study _ AWS.txt
inflating: rag_txt_dataset/Accelerate Your Analytics Journey on AWS with DXC Analytics and AI Platform _ AWS Partner Network (APN) Blog.txt
...
inflating: rag_txt_dataset/Zomato Saves Big by Using AWS Graviton2 to Power Data-Driven Business Insights.txt
inflating: rag_txt_dataset/Zoox Case Study _ Automotive _ AWS.txt
inflating: rag_txt_dataset/e-banner Streamlines Its Contact Center Operations and Facilitates a Fully Remote Workforce with Amazon Connect _ e-banner Case Study _ AWS.txt
inflating: rag_txt_dataset/iptiQ Case Study.txt
inflating: rag_txt_dataset/mod.io Provides Low Latency Gamer Experience Globally on AWS _ Case Study _ AWS.txt
inflating: rag_txt_dataset/myposter Case Study.txt

We delete the .zip

	
!rm rag_txt_dataset/archive.zip
Copy

Let's see what we have got

	
!ls rag_txt_dataset
Copy
	
'23andMe Case Study _ Life Sciences _ AWS.txt'
'36 new or updated datasets on the Registry of Open Data_ AI analysis-ready datasets and more _ AWS Public Sector Blog.txt'
'54gene _ Case Study _ AWS.txt'
'6sense Case Study.txt'
'Accelerate Time to Business Value Using Amazon SageMaker at Scale with NatWest Group _ Case Study _ AWS.txt'
'Accelerate Your Analytics Journey on AWS with DXC Analytics and AI Platform _ AWS Partner Network (APN) Blog.txt'
'Accelerating customer onboarding using Amazon Connect _ NCS Case Study _ AWS.txt'
'Accelerating Migration at Scale Using AWS Application Migration Service with 3M Company _ Case Study _ AWS.txt'
'Accelerating Time to Market Using AWS and AWS Partner AccelByte _ Omeda Studios Case Study _ AWS.txt'
'Achieving Burstable Scalability and Consistent Uptime Using AWS Lambda with TiVo _ Case Study _ AWS.txt'
'Acrobits Uses Amazon Chime SDK to Easily Create Video Conferencing Application Boosting Collaboration for Global Users _ Acrobits Case Study _ AWS.txt'
'Actuate AI Case study.txt'
'ADP Developed an Innovative and Secure Digital Wallet in a Few Months Using AWS Services _ Case Study _ AWS.txt'
'Adzuna doubles its email open rates using Amazon SES _ Adzuna Case Study _ AWS.txt'
'AEON Case Study.txt'
'ALTBalaji _ Amazon Web Services.txt'
'Amanotes Stays on Beat by Delivering Simple Music Games to Millions Worldwide on AWS.txt'
'Amazon OpenSearch Services vector database capabilities explained _ AWS Big Data Blog.txt'
'Anghami Case Study.txt'
'Announcing enhanced table extractions with Amazon Textract _ AWS Machine Learning Blog.txt'
...
'What Will Generative AI Mean for Your Business_ _ AWS Cloud Enterprise Strategy Blog.txt'
'Which Recurring Business Processes Can Small and Medium Businesses Automate_ _ AWS Smart Business Blog.txt'
Windsor.txt
'Wireless Car Case Study _ AWS IoT Core _ AWS.txt'
'Yamato Logistics (HK) case study.txt'
'Zomato Saves Big by Using AWS Graviton2 to Power Data-Driven Business Insights.txt'
'Zoox Case Study _ Automotive _ AWS.txt'

To create the chunks!

We list the documents with the function we had created

	
dataset_path = "rag_txt_dataset"
documents = load_documents_from_directory(dataset_path)
Copy

We check that we have done it well

	
for document in documents[0:10]:
print(document["id"])
Copy
	
Run Jobs at Scale While Optimizing for Cost Using Amazon EC2 Spot Instances with ActionIQ _ ActionIQ Case Study _ AWS.txt
Recommend and dynamically filter items based on user context in Amazon Personalize _ AWS Machine Learning Blog.txt
Windsor.txt
Bank of Montreal Case Study _ AWS.txt
The Mill Adventure Case Study.txt
Optimize software development with Amazon CodeWhisperer _ AWS DevOps Blog.txt
Announcing enhanced table extractions with Amazon Textract _ AWS Machine Learning Blog.txt
THREAD _ Life Sciences _ AWS.txt
Deep Pool Optimizes Software Quality Control Using Amazon QuickSight _ Deep Pool Case Study _ AWS.txt
Upstox Saves 1 Million Annually Using Amazon S3 Storage Lens _ Upstox Case Study _ AWS.txt

Now we create the chunks.

	
chunked_documents = []
for document in documents:
chunks = split_text(document["text"])
for i, chunk in enumerate(chunks):
chunked_documents.append({"id": f"{document['id']}_{i}", "text": chunk})
Copy
	
len(chunked_documents)
Copy
	
3611

As we can see, there are 3611 chunks. Since the daily limit of the Hugging Face API is 1000 calls on the free account, if we want to create embeddings for all the chunks, we would run out of available calls and additionally, we wouldn't be able to create embeddings for all the chunks.

We would like to remind you that this embedding model is very small, only 22M parameters, so it can be run on almost any computer, faster or slower, but it can be done.

Since we are only going to create the embeddings of the chunks once, even if we don't have a very powerful computer and it takes a long time, it will only run once. Later, when we want to ask questions about the documentation, that's when we will generate the embeddings of the prompt using the Hugging Face API and use the LLM with the API. So we will only have to go through the process of generating the embeddings of the chunks once.

We generate the embeddings of the chunks

Last library we will have to install. Since the process of generating embeddings for the chunks will be slow, we will install tqdm to show a progress bar. We can install it with conda or pip, whichever you prefer.

conda install conda-forge::tqdm
      

o

pip install tqdm
      

We generate the embeddings of the chunks

	
import tqdm
progress_bar = tqdm.tqdm(chunked_documents)
for chunk in progress_bar:
embedding = get_embeddings(chunk["text"])
if embedding is not None:
chunk["embedding"] = embedding
else:
print(f"Error with document {chunk['id']}")
Copy
	
100%|██████████| 3611/3611 [00:16<00:00, 220.75it/s]

We see an example

	
from random import randint
idx = randint(0, len(chunked_documents))
print(f"Chunk id: {chunked_documents[idx]['id']},\n\ntext: {chunked_documents[idx]['text']},\n\nembedding shape: {chunked_documents[idx]['embedding'].shape}")
Copy
	
Chunk id: BNS Group Case Study _ Amazon Web Services.txt_0,
text: Reducing Virtual Machines from 40 to 12
The founders of BNS had been contemplating a migration from the company’s on-premises data center to the public cloud and observed a growing demand for cloud-based operations among current and potential BNS customers.
Français
Configures security according to cloud best practices
Clive Pereira, R&D director at BNS Group, explains, “The database that records Praisal’s SMS traffic resides in Praisal’s AWS environment. Praisal can now run complete analytics across its data and gain insights into what’s happening with its SMS traffic, which is a real game-changer for the organization.” 
Español
AWS ISV Accelerate Program
Receiving Strategic, Foundational Support from ISV Specialists
Learn More
The value that AWS places on the ISV stream sealed the deal in our choice of cloud provider.”
日本語
Contact Sales
BNS is an Australian software provider focused on secure enterprise SMS and fax messaging. Its software runs on the Windows platform and is l,
embedding shape: (384,)

Load the chunks into the vector databaselink image 37

Once we have all the chunks generated, we load them into the vector database. We use tqdm again to show a progress bar, because this is also going to be slow.

	
import tqdm
progress_bar = tqdm.tqdm(chunked_documents)
for chunk in progress_bar:
collection.upsert(
ids=[chunk["id"]],
documents=chunk["text"],
embeddings=chunk["embedding"],
)
Copy
	
100%|██████████| 3611/3611 [00:59<00:00, 60.77it/s]

Questionslink image 38

Now that we have the vector database, we can ask questions to the documentation. For this, we need a function that returns the correct chunk.

Get the correct chunklink image 39

Now we need a function that returns the correct chunk, let's create it

	
def get_top_k_documents(query, k=5):
results = collection.query(query_texts=query, n_results=k)
return results
Copy

Finally, we create a query.

To generate the query, I randomly selected the document Using Amazon EC2 Spot Instances and Karpenter to Simplify and Optimize Kubernetes Infrastructure _ Neeva Case Study _ AWS.txt, passed it to an LLM, and asked it to generate a question about the document. The question it generated is

How did Neeva use Karpenter and Amazon EC2 Spot Instances to improve its infrastructure management and cost optimization?

So we get the most relevant chunks in response to that question

	
query = "How did Neeva use Karpenter and Amazon EC2 Spot Instances to improve its infrastructure management and cost optimization?"
top_chunks = get_top_k_documents(query=query, k=5)
Copy

Let's see which chunks it has returned

	
for i in range(len(top_chunks["ids"][0])):
print(f"Rank {i+1}: {top_chunks['ids'][0][i]}, distance: {top_chunks['distances'][0][i]}")
Copy
	
Rank 1: Using Amazon EC2 Spot Instances and Karpenter to Simplify and Optimize Kubernetes Infrastructure _ Neeva Case Study _ AWS.txt_0, distance: 0.29233667254447937
Rank 2: Using Amazon EC2 Spot Instances and Karpenter to Simplify and Optimize Kubernetes Infrastructure _ Neeva Case Study _ AWS.txt_5, distance: 0.4007825255393982
Rank 3: Using Amazon EC2 Spot Instances and Karpenter to Simplify and Optimize Kubernetes Infrastructure _ Neeva Case Study _ AWS.txt_1, distance: 0.4317566752433777
Rank 4: Using Amazon EC2 Spot Instances and Karpenter to Simplify and Optimize Kubernetes Infrastructure _ Neeva Case Study _ AWS.txt_6, distance: 0.43832334876060486
Rank 5: Using Amazon EC2 Spot Instances and Karpenter to Simplify and Optimize Kubernetes Infrastructure _ Neeva Case Study _ AWS.txt_4, distance: 0.44625571370124817

As I had mentioned, the document I randomly selected was Using Amazon EC2 Spot Instances and Karpenter to Simplify and Optimize Kubernetes Infrastructure _ Neeva Case Study _ AWS.txt, and as you can see, the chunks it returned are from that document. That is, out of more than 3000 chunks in the database, it was able to return the most relevant chunks in response to that question. It seems this works!

Generate the responselink image 40

Since we already have the most relevant chunks, we pass them to the LLM, along with the question, so that it can generate an answer.

	
def generate_response(query, relevant_chunks, temperature=0.5, max_tokens=1024, top_p=0.7, stream=False):
context = "\n\n".join([chunk for chunk in relevant_chunks])
prompt = f"You are an assistant for question-answering. You have to answer the following question:\n\n{query}\n\nAnswer the question with the following information:\n\n{context}"
message = [
{ "role": "user", "content": prompt }
]
stream = client.chat.completions.create(
messages=message,
temperature=temperature,
max_tokens=max_tokens,
top_p=top_p,
stream=stream,
)
response = stream.choices[0].message.content
return response
Copy

We test the function

	
response = generate_response(query, top_chunks["documents"][0])
print(response)
Copy
	
Neeva, a cloud-native, ad-free search engine founded in 2019, has leveraged Karpenter and Amazon EC2 Spot Instances to significantly improve its infrastructure management and cost optimization. Here’s how:
### Early Collaboration with Karpenter
In late 2021, Neeva began working closely with the Karpenter team, experimenting with and contributing fixes to an early version of Karpenter. This collaboration allowed Neeva to integrate Karpenter with its Kubernetes dashboard, enabling the company to gather valuable metrics on usage and performance.
### Combining Spot Instances and On-Demand Instances
Neeva runs its jobs on a large scale, which can lead to significant costs. To manage these costs effectively, the company adopted a combination of Amazon EC2 Spot Instances and On-Demand Instances. Spot Instances allow Neeva to bid on unused EC2 capacity, often at a fraction of the On-Demand price, while On-Demand Instances provide the necessary reliability for critical pipelines.
### Flexibility and Instance Diversification
According to Mohit Agarwal, infrastructure engineering lead at Neeva, Karpenter's adoption of best practices for Spot Instances, including flexibility and instance diversification, has been crucial. This approach ensures that Neeva can dynamically adjust its compute resources to meet varying workloads while minimizing costs.
### Improved Scalability and Agility
By using Karpenter to provision infrastructure resources for its Amazon EKS clusters, Neeva has achieved several key benefits:
- **Scalability**: Neeva can scale its compute resources up or down as needed, ensuring that it always has the necessary capacity to handle its workloads.
- **Agility**: The company can iterate quickly and democratize infrastructure changes, reducing the time spent on systems administration by up to 100 hours per week.
### Enhanced Development Cycles
The integration of Karpenter and Spot Instances has also accelerated Neeva's development cycles. The company can now launch new features and improvements more rapidly, which is essential for maintaining a competitive edge in the search engine market.
### Cost Savings and Budget Control
Using Spot Instances, Neeva has been able to stay within its budget while meeting its performance requirements. This cost optimization is critical for a company that prioritizes user-first experiences and has no competing incentives from advertising.
### Future Plans
Neeva is committed to continuing its innovation and expansion. The company plans to launch in new regions and further improve its search engine, all while maintaining cost efficiency. As Mohit Agarwal notes, "The bulk of our compute is or will be managed using Karpenter going forward."
### Conclusion
By leveraging Karpenter and Amazon EC2 Spot Instances, Neeva has not only optimized its infrastructure costs but also enhanced its scalability, agility, and development speed. This strategic approach has positioned Neeva to deliver high-quality, ad-free search experiences to its users while maintaining a strong focus on cost control and innovation.

When I asked the LLM to generate a question about the document, I also asked it to generate the correct answer. This is the answer that the LLM gave me.

Neeva used Karpenter and Amazon EC2 Spot Instances to improve its infrastructure management and cost optimization in several ways:
      
      Simplified Instance Management:
      
      Karpenter: By adopting Karpenter, Neeva simplified the process of provisioning and managing compute resources for its Amazon EKS clusters. Karpenter automatically provisions and de-provisions instances based on the workload, eliminating the need for manual configurations and reducing the complexity of understanding different compute instances.
      Spot Instances: Neeva leveraged Amazon EC2 Spot Instances, which are unused EC2 capacity available at a significant discount (up to 90% cost savings). This allowed the company to control costs while meeting its performance requirements.
      Enhanced Scalability:
      
      Karpenter: Karpenter's ability to dynamically scale resources enabled Neeva to spin up new instances quickly, allowing the company to iterate at a higher velocity and run more experiments in less time.
      Spot Instances: The use of Spot Instances provided flexibility and instance diversification, making it easier for Neeva to scale its compute resources efficiently.
      Improved Productivity:
      
      Karpenter: By democratizing infrastructure changes, Karpenter allowed any engineer to modify Kubernetes configurations, reducing the dependency on specialized expertise. This saved the Neeva team up to 100 hours per week of wait time on systems administration.
      Spot Instances: The ability to quickly provision and de-provision Spot Instances reduced delays in the development pipeline, ensuring that jobs did not get stuck due to a lack of available resources.
      Cost Efficiency:
      
      Karpenter: Karpenter's best practices for Spot Instances, including flexibility and instance diversification, helped Neeva use these instances more effectively, staying within budget.
      Spot Instances: The cost savings from using Spot Instances allowed Neeva to run large-scale jobs, such as indexing, for nearly the same cost but in a fraction of the time. For example, Neeva reduced its indexing jobs from 18 hours to just 3 hours.
      Better Resource Utilization:
      
      Karpenter: Karpenter provided better visibility into compute resource usage, allowing Neeva to track and optimize its resource consumption more closely.
      Spot Instances: The combination of Karpenter and Spot Instances enabled Neeva to run large language models more efficiently, enhancing the search experience for its users.
      En resumen, la adopción de Karpenter y Amazon EC2 Spot Instances por parte de Neeva mejoró significativamente su gestión de infraestructura, optimización de costos y eficiencia general en el desarrollo, lo que permitió a la compañía ofrecer mejores experiencias de búsqueda sin publicidad a sus usuarios.
      

And this has been the response generated by our RAG

Neeva, a cloud-native, ad-free search engine founded in 2019, has leveraged Karpenter and Amazon EC2 Spot Instances to significantly improve its infrastructure management and cost optimization. Here’s how:
      
      ### Early Collaboration with Karpenter
      En 2021, Neeva comenzó a trabajar estrechamente con el equipo de Karpenter, experimentando y contribuyendo con correcciones a una versión temprana de Karpenter. Esta colaboración permitió a Neeva integrar Karpenter con su panel de Kubernetes, lo que enable la empresa a recopilar métricas valiosas sobre el uso y el rendimiento.
      
      ### Combining Spot Instances and On-Demand Instances
      Neeva runs its jobs on a large scale, which can lead to significant costs. To manage these costs effectively, the company adopted a combination of Amazon EC2 Spot Instances and On-Demand Instances. Spot Instances allow Neeva to bid on unused EC2 capacity, often at a fraction of the On-Demand price, while On-Demand Instances provide the necessary reliability for critical pipelines.
      
      ### Flexibilidad y Diversificación de Instancias
      According to Mohit Agarwal, infrastructure engineering lead at Neeva, Karpenter's adoption of best practices for Spot Instances, including flexibility and instance diversification, has been crucial. This approach ensures that Neeva can dynamically adjust its compute resources to meet varying workloads while minimizing costs.
      
      ### Mejorada escalabilidad y agilidad
      By using Karpenter to provision infrastructure resources for its Amazon EKS clusters, Neeva has achieved several key benefits:
      - **Scalability**: Neeva can scale its compute resources up or down as needed, ensuring that it always has the necessary capacity to handle its workloads.
      - **Agilidad**: La empresa puede iterar rápidamente y democratizar los cambios de infraestructura, reduciendo el tiempo dedicado a la administración de sistemas hasta en 100 horas por semana.
      
      ### Enhanced Development Cycles
      La integración de Karpenter y Spot Instances también ha acelerado los ciclos de desarrollo de Neeva. La compañía ahora puede lanzar nuevas funciones y mejoras más rápidamente, lo cual es fundamental para mantener una ventaja competitiva en el mercado de motores de búsqueda.
      
      ### Cost Savings and Budget Control
      Using Spot Instances, Neeva has been able to stay within its budget while meeting its performance requirements. This cost optimization is critical for a company that prioritizes user-first experiences and has no competing incentives from advertising.
      
      ### Future Plans
      Neeva is committed to continuing its innovation and expansion. The company plans to launch in new regions and further improve its search engine, all while maintaining cost efficiency. As Mohit Agarwal notes, "The bulk of our compute is or will be managed using Karpenter going forward."
      
      ### Conclusion
      By leveraging Karpenter and Amazon EC2 Spot Instances, Neeva has not only optimized its infrastructure costs but also enhanced its scalability, agility, and development speed. This strategic approach has positioned Neeva to deliver high-quality, ad-free search experiences to its users while maintaining a strong focus on cost control and innovation.
      

So we can conclude that the RAG has worked correctly!!!

Limits of Naive RAGlink image 41

As we have said, today we have explained naive RAG, which is the simplest RAG architecture, but it has its limitations.

Naive RAG architecture

The limitations of this architecture are:

Limits in Information Retrieval (retriever)link image 42

  • Limited knowledge of context and documentation: When the naive RAG system searches for chunks, it looks for those that have a semantically similar meaning to the prompt, but it is not capable of knowing which ones are the most relevant to the user's question, or which ones have the most up-to-date information, or whose information is more accurate than that of other chunks. For example, if a user asks about the problems of sweeteners in the digestive system, the naive RAG may return documents about sweeteners or about the digestive system, but it is not capable of knowing that the documents about the digestive system are the most relevant to the user's question. Another example is if the user asks about the latest advances in AI, but the naive RAG is not able to know which are the most recent papers in the database.

  • There is no synchronization between the retrieval and the generator. As we have seen, they are two independent systems; on one hand, the retrieval searches for the documents most similar to the user's question, and those documents are passed to the generator, which generates a response.

  • Inefficient scaling with large databases. Since retrieval searches for documents with the highest semantic similarity across the entire database, when it becomes very large, we can experience very long search times.

  • Limited adaptation to the user's question. If the user asks a question that involves several documents, meaning there is no single document containing all the information from the user's question, the system will retrieve all those documents and pass them to the generator, which may or may not use them. In a worst-case scenario, it might overlook some relevant documents when generating the response.

Limits in Response Generation (Generator)link image 43

  • The model could hallucinate answers even when provided with relevant information.

  • The model could be limited by issues related to hate, discrimination, etc.

To overcome these limitations, techniques such as the

  • Pre-retrieval: This includes techniques to improve indexing, so that information retrieval is more efficient. Or techniques such as enhancing the user's query to help retrieval find the most relevant documents.

  • Post-retrieval: Here techniques such as document re-ranking are used, which is a technique employed to improve the retrieval of relevant information

Continue reading

DoLa – Decoding by Contrasting Layers Improves Factuality in Large Language Models

DoLa – Decoding by Contrasting Layers Improves Factuality in Large Language Models

Have you ever talked to an LLM and they answered you something that sounds like they've been drinking machine coffee all night long 😂 That's what we call a hallucination in the LLM world! But don't worry, because it's not that your language model is crazy (although it can sometimes seem that way 🤪). The truth is that LLMs can be a bit... creative when it comes to generating text. But thanks to DoLa, a method that uses contrast layers to improve the feasibility of LLMs, we can keep our language models from turning into science fiction writers 😂. In this post, I'll explain how DoLa works and show you a code example so you can better understand how to make your LLMs more reliable and less prone to making up stories. Let's save our LLMs from insanity and make them more useful! 🚀

Last posts -->

Have you seen these projects?

Subtify

Subtify Subtify

Subtitle generator for videos in the language you want. Also, it puts a different color subtitle to each person

View all projects -->

Do you want to apply AI in your project? Contact me!

Do you want to improve with these tips?

Last tips -->

Use this locally

Hugging Face spaces allow us to run models with very simple demos, but what if the demo breaks? Or if the user deletes it? That's why I've created docker containers with some interesting spaces, to be able to use them locally, whatever happens. In fact, if you click on any project view button, it may take you to a space that doesn't work.

Flow edit

Flow edit Flow edit

FLUX.1-RealismLora

FLUX.1-RealismLora FLUX.1-RealismLora
View all containers -->

Do you want to apply AI in your project? Contact me!

Do you want to train your model with these datasets?

short-jokes-dataset

Dataset with jokes in English

opus100

Dataset with translations from English to Spanish

netflix_titles

Dataset with Netflix movies and series

View more datasets -->