AMET-SOLID: Vector Data Base - Primer using Chroma DB

Sunday, 7 April 2024

Vector Data Base - Primer using Chroma DB

What is Vector database

A vector database is a kind of database that is designed to store, index and retrieve data points with multiple dimensions, which are commonly referred to as vectors.

Each deimension capture particular feature. Key concepts are Vectors and Similarity Search. Vectors are numerical array. Similarity search uses indexing and searching algorithms, most relevant to the query.

Vector database stores in embeddings, nothing but numerical codes which encapsulate the key charecteristics of obejct of interest.

Difference between tradiotional and Vector database

Tradional based on the scalar datatypes, 2 dimensional - rows and column- table like structure, optimized for transactional data and exact match by methods like search queries.

Vector based on vector data types, multi dimensional space structure. optimized for AI and ML applications and seacrh methods involving semantic seacrh (cosine Similarity) and similarity search.

Benefits:

Fast and Accurate Seach
Mostle correct Response
Semantic Understanding

Handson

Enter to colab using your goocle account

create a new note with name demo_chromadb

Click +Code cell

add this code to install chromadb abd senetence-transformer

!pip install chromadb -q
!pip install sentence-transformers -q

Click +Code, add this code to create a client

import chromadb

client = chromadb.Client()
collection = client.create_collection("aucse_demo")

Click +Code, add this code to add documents

collection.add(

    documents=["This is a document about cat", "This is a document about car"],
    metadatas=[{"category": "animal"}, {"category": "vehicle"}],
    ids=["id1", "id2"]
)

Click +Code, add this code to make a semantic query 

results = collection.query(

    query_texts=["vehicle"],

    n_results=1

)

print(results)

You will get ouput like

{'ids': [['id2']], 'distances': [[0.8069301843643188]],

 'metadatas': [[{'category': 'vehicle'}]], 

 'embeddings': None,

 'documents': [['This is a document about car']],

 'uris': None, 'data': None

}

The above DB is in Memory. So it is very easy and simple.
If you want to create a persistent vbdir, add five files of .txt.

Crete a folder in your notebook called 'vbdb' add some files 

Click +Code, add this code to Reading a file from folder

import os

def read_files_from_folder(folder_path):
    file_data = []

    for file_name in os.listdir(folder_path):
        if file_name.endswith(".txt"):
            with open(os.path.join(folder_path, file_name), 'r') as file:
                content = file.read()
                file_data.append({"file_name": file_name, "content": content})

    return file_data

folder_path = "/content/vbdir"  # your folder path
file_data = read_files_from_folder(folder_path)

for data in file_data:
    print(f"File Name: {data['file_name']}")
    print(f"Content: {data['content']}\n")

Click +Code, add this code to  make collections

documents = []
metadatas = []
ids = []

for index, data in enumerate(file_data):
    documents.append(data['content'])
    metadatas.append({'source': data['file_name']})
    ids.append(str(index + 1))

pet_collection = client.create_collection("pet_collection")

pet_collection.add(
    documents=documents,
    metadatas=metadatas,
    ids=ids
)

Click +Code, add this code to  get results

results = pet_collection.query(
    query_texts=["What are the different kinds of pets people commonly own?"],
    n_results=1
)
print(results)

Click +Code, add this code to  get metadata

metadatas

Click +Code, add this code to  get results for a new query

results = pet_collection.query(
    query_texts=["What are the different kinds of pets people commonly own?"],
    n_results=1
)
print(results)

Click +Code, add this code to  get results for a new query

pet_collection.query(
    query_texts=["What are the emotional benefits of owning a pet?"],
    n_results=1,
    where_document={"$contains":"reptiles"}
)
print(results)


results = pet_collection.query(
    query_texts=["What are the emotional benefits of owning a pet?"],
    n_results=1,
    where={"source": "Training and Behaviour of Pets.txt"}
)
print(results)

Click +CODE to use different model 

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('paraphrase-MiniLM-L3-v2')

documents = []
embeddings = []
metadatas = []
ids = []

for index, data in enumerate(file_data):
    documents.append(data['content'])
    embedding = model.encode(data['content']).tolist()
    embeddings.append(embedding)
    metadatas.append({'source': data['file_name']})
    ids.append(str(index + 1))

click +code to add for Creating a new collection

pet_collection_emb = client.create_collection("pet_collection_emb") pet_collection_emb.add( documents=documents, embeddings=embeddings, metadatas=metadatas, ids=ids )

Code to search again 

query = "What are the different kinds of pets people commonly own?" input_em = model.encode(query).tolist() results = pet_collection_emb.query( query_embeddings=[input_em], n_results=1 ) print(results)

Code to make a query about what foods are recommended for dogs

query = "foods that are recommended for dogs?"

input_em = model.encode(query).tolist()

results = pet_collection_emb.query(

query_embeddings=[input_em],

n_results=1

)

print(results)

Some more refernces https://blog.futuresmart.ai/chromadb-an-open-source-vector-embedding-database

AMET-SOLID

Sunday, 7 April 2024

Vector Data Base - Primer using Chroma DB

No comments:

Post a Comment

Work Diary - 2025

Happy open and Distance Learning!

Blog Archive