Vector databases
We have seen in the embeddings post that embeddings are a way of representing words in a vector space. In this post we are going to see how we can store those embeddings in vector databases and how we can make queries on them.
This notebook has been automatically translated to make it accessible to more people, please let me know if you see any typos.
When we have a query, we can create the embedding of the query, search the vector database for the embeddings that most closely match the query and return the documents that correspond to those embeddings or an explanation of those documents.
That is to say, we are going to generate a database of information, we are going to create embeddings of that information and we are going to store it in a vector database. Then when a user makes a query, we will convert the query to embeddings, search the database for the embeddings with the highest similarity and return the documents that correspond to those embeddings.
In addition to the documents, additional information can be stored in the database, which we will call metadata. For example, if we are working with a set of news, we can store the title, date, author, etc. of the news.
Chroma
In this post we are going to look at crhoma, as it is the most widely used vector database, as can be seen in this langchain state of ai 2023 report.
Installation
So to install chroma with conda you need to do the following
conda install conda-forge::chromadb
```
Or if you want to install with pip
````bash
pip install chromadb
```
Quick use
For a quick application, we first import chroma
import chromadb
Next we create a chroma client
import chromadbchroma_client = chromadb.Client()
We create a collection. A collection is the place where embeddings, embeddings and metadata will be stored.
import chromadbchroma_client = chromadb.Client()collection = chroma_client.create_collection(name="my_collection")
As we can see a message comes up indicating that no embeddings function has been entered and therefore it will default to all-MiniLM-L6-v2
, which is similar to the paraphrase-MiniLM-L6-v2
model we used in the embeddings post.
We will see this later, but we can choose how we are going to generate the embeddings.
Now we add documents, ids and metadata to the collection
import chromadbchroma_client = chromadb.Client()collection = chroma_client.create_collection(name="my_collection")collection.add(documents=["This is a python docs", "This is JavaScript docs"],metadatas=[{"source": "Python source"}, {"source": "JavaScript source"}],ids=["id1", "id2"])
Now we can make a query
import chromadbchroma_client = chromadb.Client()collection = chroma_client.create_collection(name="my_collection")collection.add(documents=["This is a python docs", "This is JavaScript docs"],metadatas=[{"source": "Python source"}, {"source": "JavaScript source"}],ids=["id1", "id2"])results = collection.query(query_texts=["This is a query of Python"],n_results=2)
import chromadbchroma_client = chromadb.Client()collection = chroma_client.create_collection(name="my_collection")collection.add(documents=["This is a python docs", "This is JavaScript docs"],metadatas=[{"source": "Python source"}, {"source": "JavaScript source"}],ids=["id1", "id2"])results = collection.query(query_texts=["This is a query of Python"],n_results=2)results
{'ids': [['id1', 'id2']],'distances': [[0.6205940246582031, 1.4631636142730713]],'metadatas': [[{'source': 'Python source'}, {'source': 'JavaScript source'}]],'embeddings': None,'documents': [['This is a python docs', 'This is JavaScript docs']],'uris': None,'data': None}
As we can see the distance to id1 is less than the distance to id2, so it seems that document 1 is more appropriate for answering the query
Persistent databases
The database that we have created before is temporary, as soon as we close the notebook it will disappear. So to create a persistent database we have to pass to chroma the path where to save it.
First we are going to create the folder where to save the database
from pathlib import Pathchroma_path = Path("chromadb")chroma_path.mkdir(exist_ok=True)
Now we create a client in the folder we created
from pathlib import Pathchroma_path = Path("chromadb")chroma_path.mkdir(exist_ok=True)chroma_client_persistent = chromadb.PersistentClient(path = str(chroma_path))
Collections
Create collections
When creating a collection, a name must be specified. The name must have the following considerations:
- The length of the name must be between 3 and 63 characters.
- The name must begin and end with a lowercase letter or digit and may contain periods, hyphens and underscores in the middle.
- The name must not contain two consecutive colons.
- The name must not be a valid IP address.
We can also give it an embedding function. In case we don't give it one, it will use by default the all-MiniLM-L6-v2
function.
from pathlib import Pathchroma_path = Path("chromadb")chroma_path.mkdir(exist_ok=True)chroma_client_persistent = chromadb.PersistentClient(path = str(chroma_path))collection = chroma_client.create_collection(name="my_other_collection")
As you can see, a second collection has been created for the same customer chroma_client
, so for a single customer we can have several collections.
Retrieve collections
If we want to retrieve a collection from a client we can do it with the get_collection
method.
from pathlib import Pathchroma_path = Path("chromadb")chroma_path.mkdir(exist_ok=True)chroma_client_persistent = chromadb.PersistentClient(path = str(chroma_path))collection = chroma_client.create_collection(name="my_other_collection")collection = chroma_client.get_collection(name = "my_collection")
Retrieve or create collections
We can obtain collections and, in case they do not exist, create them with the get_or_create_collection
method.
from pathlib import Pathchroma_path = Path("chromadb")chroma_path.mkdir(exist_ok=True)chroma_client_persistent = chromadb.PersistentClient(path = str(chroma_path))collection = chroma_client.create_collection(name="my_other_collection")collection = chroma_client.get_collection(name = "my_collection")collection = chroma_client.get_or_create_collection(name = "my_tird_collection")
Delete collections
We can delete a collection with the delete_collection
method.
from pathlib import Pathchroma_path = Path("chromadb")chroma_path.mkdir(exist_ok=True)chroma_client_persistent = chromadb.PersistentClient(path = str(chroma_path))collection = chroma_client.create_collection(name="my_other_collection")collection = chroma_client.get_collection(name = "my_collection")collection = chroma_client.get_or_create_collection(name = "my_tird_collection")chroma_client.delete_collection(name="my_tird_collection")
Get items from collections
We can obtain the first 10 items of the collection with the peek
method.
from pathlib import Pathchroma_path = Path("chromadb")chroma_path.mkdir(exist_ok=True)chroma_client_persistent = chromadb.PersistentClient(path = str(chroma_path))collection = chroma_client.create_collection(name="my_other_collection")collection = chroma_client.get_collection(name = "my_collection")collection = chroma_client.get_or_create_collection(name = "my_tird_collection")chroma_client.delete_collection(name="my_tird_collection")collection = chroma_client.get_collection(name = "my_collection")collection.peek()
{'ids': ['id1', 'id2'],'embeddings': [[-0.06924048811197281,0.061624377965927124,-0.090973399579525,0.013923337683081627,0.006247623357921839,-0.1078396588563919,-0.012472339905798435,0.03485661745071411,-0.06300634145736694,-0.00880391988903284,0.06879935413599014,0.0564003586769104,0.07040536403656006,-0.020754728466272354,-0.04048658534884453,-0.006666888482868671,-0.0953674241900444,0.049781784415245056,0.021780474111437798,-0.06344643980264664,0.06119797006249428,0.0834411084651947,-0.034758951514959335,0.0029120452236384153,...-0.013378280214965343]],'metadatas': [{'source': 'Python source'}, {'source': 'JavaScript source'}],'documents': ['This is a python docs', 'This is JavaScript docs'],'uris': None,'data': None}
In this case only two documents have been obtained, because our collection only has two documents
If you want to obtain another quantity of items you can specify it with the argument limit
.
collection.peek(limit=1)
{'ids': ['id1'],'embeddings': [[-0.06924048811197281,0.061624377965927124,-0.090973399579525,0.013923337683081627,0.006247623357921839,-0.1078396588563919,-0.012472339905798435,0.03485661745071411,-0.06300634145736694,-0.00880391988903284,0.06879935413599014,0.0564003586769104,0.07040536403656006,-0.020754728466272354,-0.04048658534884453,-0.006666888482868671,-0.0953674241900444,0.049781784415245056,0.021780474111437798,-0.06344643980264664,0.06119797006249428,0.0834411084651947,-0.034758951514959335,0.0029120452236384153,...0.012315398082137108]],'metadatas': [{'source': 'Python source'}],'documents': ['This is a python docs'],'uris': None,'data': None}
Obtain the total number of items in the collections
We can obtain the total number of items in the collection with the count
method.
collection.count()
2
Change similarity function
Earlier, when we made a query we got the similarity of the embeddings with our query, since by default in a collection the distance function is used, but we can specify which similarity function we want to use. The possibilities are
- Squared L2 (
l2
) - Inner product (
ip
) - Cosine similarity (
cosine
)
In the post Measurement of similarity between embeddings we saw L2 and cosine similarity, if you want to go deeper into them.
So we can create collections with another similarity function with the argument metadata={"hnsw:space": <function>}
.
collection = chroma_client.create_collection(name="colection_cosine", metadata={"hnsw:space": "cosine"})
Add data to the collection
Add documents
Let's look again at the data we have in the collection with the peek
method
collection = chroma_client.create_collection(name="colection_cosine", metadata={"hnsw:space": "cosine"})collection.peek()
{'ids': [],'embeddings': [],'metadatas': [],'documents': [],'uris': None,'data': None}
As we can see it is empty, that is because the last collection that we have created has been the one of the similarity function cosine
, but we have not added data to it. Let's see how it is like this obtaining the name of the collection
collection.name
'colection_cosine'
So we bring back the first collection we have created, which we have entered data for.
collection = chroma_client.get_collection(name = "my_collection")
Now we can add data to the collection with the add
method
collection = chroma_client.get_collection(name = "my_collection")collection.add(documents=["This is a Mojo docs", "This is Rust docs"],metadatas=[{"source": "Mojo source"}, {"source": "Rust source"}],ids=["id3", "id4"])
As you can see the ID
s are consecutive and do not have the same value as before, since the ID
s have to be unique.
If we try to add data by repeating ID
s, it will indicate that data with those ID
s already exists.
collection = chroma_client.get_collection(name = "my_collection")collection.add(documents=["This is a Mojo docs", "This is Rust docs"],metadatas=[{"source": "Mojo source"}, {"source": "Rust source"}],ids=["id3", "id4"])collection.add(documents=["This is a Pytorch docs", "This is TensorRT docs"],metadatas=[{"source": "Pytorch source"}, {"source": "TensorRT source"}],ids=["id3", "id4"])
Add of existing embedding ID: id3Add of existing embedding ID: id4Insert of existing embedding ID: id3Insert of existing embedding ID: id4
We have not been able to add the Pytorch and TensorRT documents.
Let's take a look at the collection data
collection.peek()
{'ids': ['id1', 'id2', 'id3', 'id4'],'embeddings': [[-0.06924048811197281,0.061624377965927124,-0.090973399579525,0.013923337683081627,0.006247623357921839,-0.1078396588563919,-0.012472339905798435,0.03485661745071411,-0.06300634145736694,-0.00880391988903284,0.06879935413599014,0.0564003586769104,0.07040536403656006,-0.020754728466272354,-0.04048658534884453,-0.006666888482868671,-0.0953674241900444,0.049781784415245056,0.021780474111437798,-0.06344643980264664,0.06119797006249428,0.0834411084651947,-0.034758951514959335,0.0029120452236384153,...{'source': 'JavaScript source'},{'source': 'Mojo source'},{'source': 'Rust source'}],'documents': ['This is a python docs','This is JavaScript docs','This is a Mojo docs','This is Rust docs'],'uris': None,'data': None}
As we can see, the original contents of ID3
and ID4
have been maintained.
Add embeddings
We can add embeddings directly without adding documents. Although this does not make much sense, because if we only add the embeddings, when we want to make a query there will be no documents to retrieve.
We obtain some embeddings to create others with the same dimensions.
embedding1 = collection.peek(1)['embeddings']len(embedding1), len(embedding1[0])
(1, 384)
We create new embeddings with all of them to know that they are the ones we have created.
new_embedding = [1] * len(embedding1[0])new_embedding = [new_embedding]len(new_embedding), len(new_embedding[0])
(1, 384)
Now we add the new embeddings
collection.add(embeddings=new_embedding,metadatas=[{"source": "Only embeddings"}],ids=["id5"])
Let's take a look at the collection data
collection.add(embeddings=new_embedding,metadatas=[{"source": "Only embeddings"}],ids=["id5"])collection.peek()['embeddings'][-1]
[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0]
The last element of the condition has the embeddings that we have added
Note: If we try to add embbedings with a different size than the ones already in the collection, we will get an error.
new_embedding_differetn_size = [1] * (len(embedding1[0])-1)new_embedding_differetn_size = [new_embedding_differetn_size]len(new_embedding_differetn_size), len(new_embedding_differetn_size[0])
(1, 383)
As you can see the embedding dimension is 383, instead of 384.
collection.add(embeddings=new_embedding_differetn_size,metadatas=[{"source": "New embeddings different size"}],ids=["id6"])
---------------------------------------------------------------------------InvalidDimensionException Traceback (most recent call last)Cell In[28], line 1----> 1 collection.add(2 embeddings=new_embedding_differetn_size,3 metadatas=[{"source": "New embeddings different size"}],4 ids=["id6"]5 )File ~/miniforge3/envs/crhomadb/lib/python3.11/site-packages/chromadb/api/models/Collection.py:168, in Collection.add(self, ids, embeddings, metadatas, documents, images, uris)163 raise ValueError(164 "You must set a data loader on the collection if loading from URIs."165 )166 embeddings = self._embed(self._data_loader(uris))--> 168 self._client._add(ids, self.id, embeddings, metadatas, documents, uris)File ~/miniforge3/envs/crhomadb/lib/python3.11/site-packages/chromadb/telemetry/opentelemetry/__init__.py:127, in trace_method.<locals>.decorator.<locals>.wrapper(*args, **kwargs)125 global tracer, granularity126 if trace_granularity < granularity:--> 127 return f(*args, **kwargs)128 if not tracer:129 return f(*args, **kwargs)File ~/miniforge3/envs/crhomadb/lib/python3.11/site-packages/chromadb/api/segment.py:375, in SegmentAPI._add(self, ids, collection_id, embeddings, metadatas, documents, uris)365 records_to_submit = []366 for r in _records(367 t.Operation.ADD,368 ids=ids,(...)373 uris=uris,374 ):--> 375 self._validate_embedding_record(coll, r)376 records_to_submit.append(r)377 self._producer.submit_embeddings(coll["topic"], records_to_submit)File ~/miniforge3/envs/crhomadb/lib/python3.11/site-packages/chromadb/telemetry/opentelemetry/__init__.py:127, in trace_method.<locals>.decorator.<locals>.wrapper(*args, **kwargs)125 global tracer, granularity126 if trace_granularity < granularity:--> 127 return f(*args, **kwargs)128 if not tracer:129 return f(*args, **kwargs)File ~/miniforge3/envs/crhomadb/lib/python3.11/site-packages/chromadb/api/segment.py:799, in SegmentAPI._validate_embedding_record(self, collection, record)797 add_attributes_to_current_span({"collection_id": str(collection["id"])})798 if record["embedding"]:--> 799 self._validate_dimension(collection, len(record["embedding"]), update=True)File ~/miniforge3/envs/crhomadb/lib/python3.11/site-packages/chromadb/telemetry/opentelemetry/__init__.py:127, in trace_method.<locals>.decorator.<locals>.wrapper(*args, **kwargs)125 global tracer, granularity126 if trace_granularity < granularity:--> 127 return f(*args, **kwargs)128 if not tracer:129 return f(*args, **kwargs)File ~/miniforge3/envs/crhomadb/lib/python3.11/site-packages/chromadb/api/segment.py:814, in SegmentAPI._validate_dimension(self, collection, dim, update)812 self._collection_cache[id]["dimension"] = dim813 elif collection["dimension"] != dim:--> 814 raise InvalidDimensionException(815 f"Embedding dimension {dim} does not match collection dimensionality {collection['dimension']}"816 )817 else:818 returnInvalidDimensionException: Embedding dimension 383 does not match collection dimensionality 384
Add documents and embeddings
Chroma also allows us to add documents and embeddings at the same time. So if this is done, it will not create the embeddings of the document.
collection.add(documents=["This is a Pytorch docs"],embeddings=new_embedding,metadatas=[{"source": "Pytorch source"}],ids=["id6"])
If we look at the embeddings of the last element of the collection, we will see that these are the ones we added
collection.add(documents=["This is a Pytorch docs"],embeddings=new_embedding,metadatas=[{"source": "Pytorch source"}],ids=["id6"])collection.peek()['embeddings'][-1]
[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0]
Inquiries
Queries by documents
To make a query we use the query
method. With the n_results
parameter we can specify how many results we want to get
collection.query(query_texts=["python"],n_results=1,)
{'ids': [['id1']],'distances': [[0.5389559268951416]],'metadatas': [[{'source': 'Python source'}]],'embeddings': None,'documents': [['This is a python docs']],'uris': None,'data': None}
If instead of n_results = 1
we set a higher value, it will return more results.
collection.query(
query_texts=["python"],
n_results=10,
)
We can filter by a metadata value with the argument where
.
collection.query(
query_texts=["python"],
n_results=10,
where={"source": "Python source"}
)
We can see that only 1 result is already returned
We can also filter by document content with the argument where_document
.
collection.query(
query_texts=["python"],
n_results=10,
where_document={"$contains": "python"}
)
We will see later on the possibilities we have here
When we make a query we can say what data we want to be returned, for example only the embeddings, only the metadata, or several data by specifying it in a list using the include
argument.
collection.query(
query_texts=["python"],
n_results=10,
include=["documents", "distances"]
)
We see that now metadatas
is None
.
Several queries at once
We can make several queries to the collection at the same time, for this, we pass a list to the query_texts
parameter
collection.query(query_texts=["python"],n_results=10,)collection.query(query_texts=["python"],n_results=10,where={"source": "Python source"})collection.query(query_texts=["python"],n_results=10,where_document={"$contains": "python"})collection.query(query_texts=["python"],n_results=10,include=["documents", "distances"])collection.query(query_texts=["programming language", "high level", "multi propuse"],n_results=1,)
Number of requested results 10 is greater than number of elements in index 6, updating n_results = 6Number of requested results 10 is greater than number of elements in index 6, updating n_results = 6Number of requested results 10 is greater than number of elements in index 6, updating n_results = 6Number of requested results 10 is greater than number of elements in index 6, updating n_results = 6{'ids': [['id1'], ['id1'], ['id3']],'distances': [[1.152251958847046], [1.654376745223999], [1.6786067485809326]],'metadatas': [[{'source': 'Python source'}],[{'source': 'Python source'}],[{'source': 'Mojo source'}]],'embeddings': None,'documents': [['This is a python docs'],['This is a python docs'],['This is a Mojo docs']],'uris': None,'data': None}
For each query it returned a result
This is very useful when the database is hosted on a server and we are charged for each query we make. So instead of making a query for each question we have, we make a query with all the questions we have.
Queries for embeddings
When we make a query for documents, what chroma does is to calculate the embedding of the query_texts
and look for the documents that most resemble that embedding. But if we already have the embedding, we can make the query directly with the embedding.
Let's first get the embedding of a query with the same embedding function of the collections
query_texts = ["python language"]query_embeddings = collection._embedding_function(query_texts)query_embeddings
[[-0.04816831275820732,0.014662696048617363,-0.031021444126963615,0.008308809250593185,-0.07176128774881363,-0.10355626791715622,0.06690476089715958,0.04229631647467613,-0.03681119903922081,-0.04993892088532448,0.03186540678143501,0.015252595767378807,0.0642094686627388,0.018130118027329445,0.016300885006785393,-0.028082313016057014,-0.03994889184832573,0.023195551708340645,0.004547565709799528,-0.11764183640480042,0.019792592152953148,0.0496944822371006,-0.013253907673060894,0.03610404208302498,0.030529780313372612,-0.01815914921462536,-0.009753326885402203,0.03412770479917526,0.03020440600812435,...0.02079579420387745,-0.00972712505608797,0.13462257385253906,0.15277136862277985,-0.028574923053383827]]
Now we can make the query with embedding
collection.query(query_embeddings=query_embeddings,n_results=1,)
{'ids': [['id1']],'distances': [[0.6297433376312256]],'metadatas': [[{'source': 'Python source'}]],'embeddings': None,'documents': [['This is a python docs']],'uris': None,'data': None}
As before we can get more results by increasing the value of the n_results
parameter, and we can filter with the where
and where_document
parameters. We can also do several queries at once, and we can specify what data we want to return with the include
parameter.
Note: If we try to make a query with an embedding of a different dimension than the ones already in the collection, we will get an error.
Retrieve documents by ID
.
If we know the ID
of a document, we can retrieve the document with the get
method
collection.get(ids=["id1"],)
{'ids': ['id1'],'embeddings': None,'metadatas': [{'source': 'Python source'}],'documents': ['This is a python docs'],'uris': None,'data': None}
Multiple documents can also be retrieved at once.
collection.get(ids=["id1", "id2", "id3"],)
{'ids': ['id1', 'id2', 'id3'],'embeddings': None,'metadatas': [{'source': 'Python source'},{'source': 'JavaScript source'},{'source': 'Mojo source'}],'documents': ['This is a python docs','This is JavaScript docs','This is a Mojo docs'],'uris': None,'data': None}
As before we can filter with the where
and where_document
arguments. We can also make several queries at once, and we can specify what data we want to return with the include
parameter.
Filtering
As we have seen, it is possible to filter by metadata with the parameter where
, and by document content with the parameter where_document
.
Filtering by metadata
How metadata will enter me as a dictionary
collection.add(
documents=["This is a python docs", "This is JavaScript docs"],
metadatas=[{"source": "Python source"}, {"source": "JavaScript source"}],
ids=["id1", "id2"]
```
The first thing we have to do is to indicate the key of the metadata by which we want to filter. Next we have to put an operator and the value
````python
{
"metadata_field": {
<Operator>: <Value>.
}
}
```
The possible values of the oerator are
- $eq - equal to (string, int, float)
- $ne - not equal to (string, int, float)
- $gt - greater than (int, float)
- $gte - greater than or equal to (int, float)
$lt - less than (int, float)
- $lte - less than or equal to (int, float)
Let's see now a query
collection.query(query_texts=["python"],n_results=1,where={"source":{"$eq": "Python source"}})
{'ids': [['id1']],'distances': [[0.5389559268951416]],'metadatas': [[{'source': 'Python source'}]],'embeddings': None,'documents': [['This is a python docs']],'uris': None,'data': None}
If we do not set operator, the default will be $eq
, i.e., that is, this
{
"metadata_field": {
<"$eq">: <Value>.
}
}
```
It is the same as this
````python
{
"metadata_field": <Value>.
}
```
Note: Chroma will only search data that has the source
metadata, for example if you search where={"version": {"$ne": 1}}
it will only return data that has a version
key in its metadata and is not 1.
Filtering by document content
When filtering by document content we have two possible keys $contains
and $not_contains
.
For example, we look for the data in the collection in which the word python
appears in your document
collection.query(
query_texts=["python"],
n_results=10,
where_document={"$contains": "python"}
)
And all the data in the collection in which the word python
does not appear in your document
collection.query(
query_texts=["python"],
n_results=10,
where_document={"$not_contains": "python"}
)
We can also use the logical operators $and
and $or
to make more complex queries.
{
"$and": [
{
<Operator>: <Value>.
},
{
<Operator>: <Value>.
}
}
```
````python
{
"$or": [
{
<Operator>: <Value>.
},
{
<Operator>: <Value>.
}
}
```
For example, we search for all documents containing the words python
and docs
.
collection.query(
query_texts=["python"],
n_results=10,
where_document=
{
"$and": [
{"$contains": "python"},
{"$contains": "docs"},
],
},
)
Update data
Any item of data can be updated with the update method
collection.query(query_texts=["python"],n_results=10,where_document={"$contains": "python"})collection.query(query_texts=["python"],n_results=10,where_document={"$not_contains": "python"})collection.query(query_texts=["python"],n_results=10,where_document={"$and": [{"$contains": "python"},{"$contains": "docs"},],},)collection.update(ids=["id1"],documents=["This is a updated Python docs"])
Let's see if it has been updated
collection.query(
query_texts=["python"],
n_results=10,
where_document={"$contains": "Python"}
)
Note: If you try to update an
ID
that does not exist, you will get an error.
Note: If we try to update an embeddings with another embeddings of a different size, we will get an error.
Update or add data
With the upsert
method we can update a data if it already exists, or add it if it does not exist.
collection.query(query_texts=["python"],n_results=10,where_document={"$contains": "python"})collection.query(query_texts=["python"],n_results=10,where_document={"$not_contains": "python"})collection.query(query_texts=["python"],n_results=10,where_document={"$and": [{"$contains": "python"},{"$contains": "docs"},],},)collection.update(ids=["id1"],documents=["This is a updated Python docs"])collection.query(query_texts=["python"],n_results=10,where_document={"$contains": "Python"})collection.upsert(ids=["id6"],documents=["This is a Pytorch docs"],metadatas=[{"source": "Pytorch source"}],)
Let's see if it has been added to the collection
collection.query(query_texts=["python"],n_results=10,where_document={"$contains": "python"})collection.query(query_texts=["python"],n_results=10,where_document={"$not_contains": "python"})collection.query(query_texts=["python"],n_results=10,where_document={"$and": [{"$contains": "python"},{"$contains": "docs"},],},)collection.update(ids=["id1"],documents=["This is a updated Python docs"])collection.query(query_texts=["python"],n_results=10,where_document={"$contains": "Python"})collection.upsert(ids=["id6"],documents=["This is a Pytorch docs"],metadatas=[{"source": "Pytorch source"}],)collection.peek()
Number of requested results 10 is greater than number of elements in index 6, updating n_results = 6Number of requested results 10 is greater than number of elements in index 6, updating n_results = 6Number of requested results 10 is greater than number of elements in index 6, updating n_results = 6Number of requested results 10 is greater than number of elements in index 6, updating n_results = 6{'ids': ['id1', 'id2', 'id3', 'id4', 'id5', 'id6'],'embeddings': [[-0.08374718576669693,0.01027572900056839,-0.04819200187921524,0.01758415624499321,0.013158757239580154,-0.11435151100158691,-0.024248722940683365,-0.01319972239434719,-0.09626100957393646,-0.010561048053205013,0.09369225800037384,0.06017905846238136,0.031283188611269,0.014855983667075634,-0.0015984248602762818,0.023238031193614006,-0.04709107056260109,-0.007838696241378784,0.012870412319898605,-0.028354981914162636,-0.007653804495930672,0.09018168598413467,0.060235824435949326,0.0005205210763961077,...0.014388148672878742]],'metadatas': [{'source': 'Python source'},{'source': 'JavaScript source'},{'source': 'Mojo source'},{'source': 'Rust source'},{'source': 'Only embeddings'},{'source': 'Pytorch source'}],'documents': ['This is a updated Python docs','This is JavaScript docs','This is a Mojo docs','This is Rust docs',None,'This is a Pytorch docs'],'uris': None,'data': None}
We see that yes
Delete data
We can delete data from a collection with the delete
method
We are going to delete the datum with ID
id5
which is the one we added with its embedding all to ones
collection.delete(ids=["id5"])
Let's see if it has been removed
collection.delete(ids=["id5"])collection.peek()
{'ids': ['id1', 'id2', 'id3', 'id4', 'id6'],'embeddings': [[-0.08374718576669693,0.01027572900056839,-0.04819200187921524,0.01758415624499321,0.013158757239580154,-0.11435151100158691,-0.024248722940683365,-0.01319972239434719,-0.09626100957393646,-0.010561048053205013,0.09369225800037384,0.06017905846238136,0.031283188611269,0.014855983667075634,-0.0015984248602762818,0.023238031193614006,-0.04709107056260109,-0.007838696241378784,0.012870412319898605,-0.028354981914162636,-0.007653804495930672,0.09018168598413467,0.060235824435949326,0.0005205210763961077,...0.07033486664295197,0.014388148672878742]],'metadatas': [{'source': 'Python source'},{'source': 'JavaScript source'},{'source': 'Mojo source'},{'source': 'Rust source'},{'source': 'Pytorch source'}],'documents': ['This is a updated Python docs','This is JavaScript docs','This is a Mojo docs','This is Rust docs','This is a Pytorch docs'],'uris': None,'data': None}
We see that it is no longer
Embeddings
As we have said we can use different embeddings functions and if none is specified it will use all-MiniLM-L6-v2
. In the chroma embeddings documentation page we can see the different embeddings functions we can use. As this is something that can change, and also some of them are paid and require api key, we are going to explain only how to use the HuggingFace ones.
First we set the embedding function
import chromadb.utils.embedding_functions as embedding_functionshuggingface_ef = embedding_functions.HuggingFaceEmbeddingFunction(api_key="YOUR_API_KEY",model_name="sentence-transformers/all-mpnet-base-v2")
In my case I use sentence-transformers/all-mpnet-base-v2
which is the most downloaded of sentence-transformers
at the time of writing this post.
To now add the embedding function to the collection, we have to add the argument metadata={"embedding": <function>}
.
import chromadb.utils.embedding_functions as embedding_functionshuggingface_ef = embedding_functions.HuggingFaceEmbeddingFunction(api_key="YOUR_API_KEY",model_name="sentence-transformers/all-mpnet-base-v2")collection = chroma_client.create_collection(name="colection_huggingface",embedding_function=huggingface_ef)
We can check that we have added the new embedding function by calculating the embeddings of a word
import chromadb.utils.embedding_functions as embedding_functionshuggingface_ef = embedding_functions.HuggingFaceEmbeddingFunction(api_key="YOUR_API_KEY",model_name="sentence-transformers/all-mpnet-base-v2")collection = chroma_client.create_collection(name="colection_huggingface",embedding_function=huggingface_ef)embedding = collection._embedding_function(["python"])len(embedding), len(embedding[0])
(1, 768)
The embedding length is 768
If we now calculate the embedding with the embedding function of the previous collection
collection = chroma_client.get_collection(name = "my_collection")
collection = chroma_client.get_collection(name = "my_collection")embedding = collection._embedding_function(["python"])len(embedding), len(embedding[0])
(1, 384)
We see that now the length of the ambedding is 384, i.e., we had used a new embedding function before
Multimodality
We can add image embeddings since chroma has built-in OpenCLIP. OpenCLIP is an open source implementation of CLIP (Contrastive Language-Image Pre-Training), which is an OpenAI neural network that is able to give a description of an image.
In order to use OpenCLIP, we have to install it with pip
pip install open-clip-torch
```
Once installed we can use it to create embeddings of the following picture
I have it in my local path ../images/chromadb_dalle3.webp
.
from chromadb.utils.embedding_functions import OpenCLIPEmbeddingFunctionembedding_function = OpenCLIPEmbeddingFunction()image = "../images/chromadb_dalle3.webp"embedding = embedding_function(image)len(embedding), len(embedding[0])
(30, 512)
As we can see it creates an embeddings of size 30x512
Chroma also comes with an image uploader
from chromadb.utils.data_loaders import ImageLoaderdata_loader = ImageLoader()data = data_loader._load_image(image)type(data), data.shape
(numpy.ndarray, (1024, 1024, 3))
So we can create a multimodal collection with this embedding function and the image uploader
collection = chroma_client.create_collection(name="multimodal_collection",embedding_function=embedding_function,data_loader=data_loader)
And we can add the embeddings of the images
collection = chroma_client.create_collection(name="multimodal_collection",embedding_function=embedding_function,data_loader=data_loader)collection.add(ids=['id1'],images=[image])
Let's see what you have saved
collection = chroma_client.create_collection(name="multimodal_collection",embedding_function=embedding_function,data_loader=data_loader)collection.add(ids=['id1'],images=[image])collection.peek()
{'ids': ['id1'],'embeddings': [[-0.014372998848557472,0.0063015008345246315,-0.03794914484024048,-0.028725482523441315,-0.014304812066257,-0.04323698952794075,0.008670451119542122,-0.016066772863268852,-0.02365742437541485,0.07881983369588852,0.022775636985898018,0.004407387692481279,0.058205753564834595,-0.02389293536543846,-0.027586588636040688,0.05778728798031807,-0.2631031572818756,0.044124454259872437,0.010588622651994228,-0.035578884184360504,-0.041719693690538406,-0.0033654430881142616,-0.04731074720621109,-0.0019943572115153074,...0.04397008568048477,0.04396628588438034]],'metadatas': [None],'documents': [None],'uris': None,'data': None}
Chroma does not store the images, only the embeddings, so in order not to lose the relationship between the embeddings and the images, we can save the path to the images in the metadata. Let's use the update
method to add the path to the image
collection.update(ids=['id1'],images=[image],metadatas=[{"source": image}])
If we go back to see what the collection has in store.
collection.update(ids=['id1'],images=[image],metadatas=[{"source": image}])collection.peek()
{'ids': ['id1'],'embeddings': [[-0.014372998848557472,0.0063015008345246315,-0.03794914484024048,-0.028725482523441315,-0.014304812066257,-0.04323698952794075,0.008670451119542122,-0.016066772863268852,-0.02365742437541485,0.07881983369588852,0.022775636985898018,0.004407387692481279,0.058205753564834595,-0.02389293536543846,-0.027586588636040688,0.05778728798031807,-0.2631031572818756,0.044124454259872437,0.010588622651994228,-0.035578884184360504,-0.041719693690538406,-0.0033654430881142616,-0.04731074720621109,-0.0019943572115153074,...0.04397008568048477,0.04396628588438034]],'metadatas': [{'source': '../images/chromadb_dalle3.webp'}],'documents': [None],'uris': None,'data': None}
As the collection is multimodal, we can add documents to it as before.
collection.add(ids=['id2', 'id3'],documents=["This is a python docs", "This is JavaScript docs"],metadatas=[{"source": "Python source"}, {"source": "JavaScript source"}])collection.peek()
{'ids': ['id1', 'id2', 'id3'],'embeddings': [[-0.014372998848557472,0.0063015008345246315,-0.03794914484024048,-0.028725482523441315,-0.014304812066257,-0.04323698952794075,0.008670451119542122,-0.016066772863268852,-0.02365742437541485,0.07881983369588852,0.022775636985898018,0.004407387692481279,0.058205753564834595,-0.02389293536543846,-0.027586588636040688,0.05778728798031807,-0.2631031572818756,0.044124454259872437,0.010588622651994228,-0.035578884184360504,-0.041719693690538406,-0.0033654430881142616,-0.04731074720621109,-0.0019943572115153074,...-0.061795610934495926,-0.02433035336434841]],'metadatas': [{'source': '../images/chromadb_dalle3.webp'},{'source': 'Python source'},{'source': 'JavaScript source'}],'documents': [None, 'This is a python docs', 'This is JavaScript docs'],'uris': None,'data': None}
Finally, we can make queries with text
collection.query(
query_texts=["persona trabajando en una mesa"],
)
With text we didn't get the image as the first result, if we follow the python documentation
But we can also make them with images, in this case I'm going to do it with this image
query_image = "https://pub-fb664c455eca46a2ba762a065ac900f7.r2.dev/chromadb_elegant.webp"
collection.query(
query_images=[query_image],
)
Now it gives as first result the image that we had saved.