AMET-SOLID

Monday, 8 April 2024

ANOVA - Primer

Analysis of Variance

What is ANOVA

Analysis of Variance

Statistical test for detecting differences in the group means when there is one one parametric dependent variable and one or more independent variable.

We are intrested in determining whther differences exist between the population means.

Types of ANOVA

1 - way and 2-way

1-way:

- 1 dependent and 1 independent variable

2-way :

- 2 dependent and 2 or more independent variables

Key terms

Null Hypothesis H0

- General statement that states that no relationship between 2 measured phonomena or no association amoong groups

Alternative Hypothesis HA

- Contrary to Null hypothesis, it states that whenever something is happening, a new theory is preferred instead of old

P Value

- The probability of finding the observed, or more extreme, results when null hypothesis of a study question is True.

Alpha Value

- Creterion for determining whther a test statistic is technically significant.

F - statistics

- Extent of difference between the means of different trials

Sum of Squares

- Variation from the mean of different medical trials

Mean

- Average of all the results from evidences like medical trials.

How we can use ANOVA

Anova determines whther the groups created by the levels of independent variable are statistically different by calculatin the whether the means of the different samples are different from the overall mean of the dependent variable

If any of group means is significantly different from the overall mean, then the null hypothesis is rejected.

F - Statistics

Value you get when you run the ANOVA test to find out if the means bewteen two populations are significantly different.

Between-group variance is large relative to the value within group variance, so F statistic will be larger & > the critical value, therefore significantly different.

ANOVA formula

is made up of numerous parts. The best way to tackle an ANOVA test problem is to organize the formulae inside an ANOVA table. Below are the ANOVA formulae.

Source of Variation	Sum of Squares	Degree of Freedom	Mean Squares	F Value
Between Groups	SSB = Σnj(X̄j– X̄)2	df1 = k – 1	MSB = SSB / (k – 1)	f = MSB / MSE or, F = MST/MSE
Error	SSE = Σnj(X̄- X̄j)2	df2 = N – k	MSE = SSE / (N – k)
Total	SST = SSB + SSE	df3 = N – 1

Source of Variation

Sum of Squares

Degree of Freedom

Mean Squares

F Value

Between Groups

SSB = Σnj(X̄j– X̄)2

df1 = k – 1

MSB = SSB / (k – 1)

f = MSB / MSE

or, F = MST/MSE

Error

SSE = Σnj(X̄- X̄j)2

df2 = N – k

MSE = SSE / (N – k)

Total

SST = SSB + SSE

df3 = N – 1

where,

F = ANOVA Coefficient
MSB = Mean of the total of squares between groupings
MSW = Mean total of squares within groupings
MSE = Mean sum of squares due to error
SST = total Sum of squares
p = Total number of populations
n = The total number of samples in a population
SSW = Sum of squares within the groups
SSB = Sum of squares between the groups
SSE = Sum of squares due to error
s = Standard deviation of the samples
N = Total number of observations

Worked Examples

Example 1: Three different kinds of food are tested on three groups of rats for 5 weeks. The objective is to check the difference in mean weight(in grams) of the rats per week. Apply one-way ANOVA using a 0.05 significance level to the following data:

Food I	Food II	Food III
8	4	11
12	5	8
19	4	7
8	6	13
6	9	7
11	7	9

Solution:

H0: μ1= μ2=μ3
H1: The means are not equal
Since, X̄1 = 5, X̄2 = 9, X̄3 = 10
Total mean = X̄ = 8
SSB = 6(5 – 8)2 + 6(9 – 8)2 + 6(10 – 8)2 = 84
SSE = 68
MSB = SSB/df1 = 42
MSE = SSE/df2 = 4.53
f = MSB/MSE = 42/4.53 = 9.33
Since f > F, the null hypothesis stands rejected.

Example 2: Calculate the ANOVA coefficient for the following data:

Plant	Number	Average span	s
Hibiscus	5	12	2
Marigold	5	16	1
Rose	5	20	4

Solution:

Plant n x s s2
Hibiscus 5 12 2 4
Marigold 5 16 1 1
Rose 5 20 4 16
p = 3
n = 5
N = 15
x̄ = 16
SST = Σn(x−x̄)2
SST= 5(12 − 16)2 + 5(16 − 16)2 + 11(20 − 16)2 = 160
MST = SST/p-1 = 160/3-1 = 80
SSE = ∑ (n−1) = 4 (4 + 1) + 4(16) = 84
MSE = 7
F = MST/MSE = 80/7

F = 11.429

Plant	n	x	s	s2
Hibiscus	5	12	2	4
Marigold	5	16	1	1
Rose	5	20	4	16

Example 3: The following data show the number of worms quarantined from the GI areas of four groups of muskrats in a carbon tetrachloride anthelmintic study. Conduct a two-way ANOVA test.

I	II	III	IV
338	412	124	389
324	387	353	432
268	400	469	255
147	233	222	133
309	212	111	265

Solution:

Source of Variation Sum of Squares Degrees of Freedom Mean Square
Between the groups 62111.6 8 9078.067
Within the groups 98787.8 16 4567.89
Total 167771.4 24
Since F = MST / MSE
= 9.4062 / 3.66 = 2.57

Source of Variation	Sum of Squares	Degrees of Freedom	Mean Square
Between the groups	62111.6	8	9078.067
Within the groups	98787.8	16	4567.89
Total	167771.4	24

Example 4: Three types of fertilizers are used on three groups of plants for 5 weeks. We want to check if there is a difference in the mean growth of each group. Using the data given below apply a one way ANOVA test at 0.05 significant level.

Fertilizer 1	Fertilizer 2	Fertilizer 3
6	8	13
8	12	9
4	9	11
5	11	8
3	6	7
4	8	12

Solution:

$H_{0}$ : $μ_{1}$ = $μ_{2}$ = $μ_{3}$

$H_{1}$ : The means are not equal

Fertilizer 1	Fertilizer 2	Fertilizer 3
6	8	13
8	12	9
4	9	11
5	11	8
3	6	7
4	8	12
${\bar{X}}_{1}$ = 5	${\bar{X}}_{1}$ = 9	${\bar{X}}_{1}$ = 10

Total mean, $\bar{X}$ = 8

$n_{1}$ = $n_{2}$ = $n_{3}$ = 6, k = 3

SSB = 6(5 - 8)² + 6(9 - 8)² + 6(10 - 8)²

= 84

df1 = k - 1 = 2

Fertilizer 1	(X - 5)²	Fertilizer 2	(X - 9)²	Fertilizer 3	(X - 10)²
6	1	8	1	13	9
8	9	12	9	9	1
4	1	9	0	11	1
5	0	11	4	8	4
3	4	6	9	7	9
4	1	8	1	12	4
${\bar{X}}_{1}$ = 5	Total = 16	${\bar{X}}_{1}$ = 9	Total = 24	${\bar{X}}_{1}$ = 10	Total = 28

SSE = 16 + 24 + 28 = 68

N = 18

df2 = N - k = 18 - 3 = 15

MSB = SSB / df1 = 84 / 2 = 42

MSE = SSE / df2 = 68 / 15 = 4.53

ANOVA test statistic, f = MSB / MSE = 42 / 4.53 = 9.33

Using the f table at $α$ = 0.05 the critical value is given as F(0.05, 2, 15) = 3.68

As f > F, thus, the null hypothesis is rejected and it can be concluded that there is a difference in the mean growth of the plants.

Answer: Reject the null hypothesis

Sunday, 7 April 2024

Vector Data Base - Primer using Chroma DB

What is Vector database

A vector database is a kind of database that is designed to store, index and retrieve data points with multiple dimensions, which are commonly referred to as vectors.

Each deimension capture particular feature. Key concepts are Vectors and Similarity Search. Vectors are numerical array. Similarity search uses indexing and searching algorithms, most relevant to the query.

Vector database stores in embeddings, nothing but numerical codes which encapsulate the key charecteristics of obejct of interest.

Difference between tradiotional and Vector database

Tradional based on the scalar datatypes, 2 dimensional - rows and column- table like structure, optimized for transactional data and exact match by methods like search queries.

Vector based on vector data types, multi dimensional space structure. optimized for AI and ML applications and seacrh methods involving semantic seacrh (cosine Similarity) and similarity search.

Benefits:

Fast and Accurate Seach
Mostle correct Response
Semantic Understanding

Handson

Enter to colab using your goocle account

create a new note with name demo_chromadb

Click +Code cell

add this code to install chromadb abd senetence-transformer

!pip install chromadb -q
!pip install sentence-transformers -q

Click +Code, add this code to create a client

import chromadb

client = chromadb.Client()
collection = client.create_collection("aucse_demo")

Click +Code, add this code to add documents

collection.add(

    documents=["This is a document about cat", "This is a document about car"],
    metadatas=[{"category": "animal"}, {"category": "vehicle"}],
    ids=["id1", "id2"]
)

Click +Code, add this code to make a semantic query 

results = collection.query(

    query_texts=["vehicle"],

    n_results=1

)

print(results)

You will get ouput like

{'ids': [['id2']], 'distances': [[0.8069301843643188]],

 'metadatas': [[{'category': 'vehicle'}]], 

 'embeddings': None,

 'documents': [['This is a document about car']],

 'uris': None, 'data': None

}

The above DB is in Memory. So it is very easy and simple.
If you want to create a persistent vbdir, add five files of .txt.

Crete a folder in your notebook called 'vbdb' add some files 

Click +Code, add this code to Reading a file from folder

import os

def read_files_from_folder(folder_path):
    file_data = []

    for file_name in os.listdir(folder_path):
        if file_name.endswith(".txt"):
            with open(os.path.join(folder_path, file_name), 'r') as file:
                content = file.read()
                file_data.append({"file_name": file_name, "content": content})

    return file_data

folder_path = "/content/vbdir"  # your folder path
file_data = read_files_from_folder(folder_path)

for data in file_data:
    print(f"File Name: {data['file_name']}")
    print(f"Content: {data['content']}\n")

Click +Code, add this code to  make collections

documents = []
metadatas = []
ids = []

for index, data in enumerate(file_data):
    documents.append(data['content'])
    metadatas.append({'source': data['file_name']})
    ids.append(str(index + 1))

pet_collection = client.create_collection("pet_collection")

pet_collection.add(
    documents=documents,
    metadatas=metadatas,
    ids=ids
)

Click +Code, add this code to  get results

results = pet_collection.query(
    query_texts=["What are the different kinds of pets people commonly own?"],
    n_results=1
)
print(results)

Click +Code, add this code to  get metadata

metadatas

Click +Code, add this code to  get results for a new query

results = pet_collection.query(
    query_texts=["What are the different kinds of pets people commonly own?"],
    n_results=1
)
print(results)

Click +Code, add this code to  get results for a new query

pet_collection.query(
    query_texts=["What are the emotional benefits of owning a pet?"],
    n_results=1,
    where_document={"$contains":"reptiles"}
)
print(results)


results = pet_collection.query(
    query_texts=["What are the emotional benefits of owning a pet?"],
    n_results=1,
    where={"source": "Training and Behaviour of Pets.txt"}
)
print(results)

Click +CODE to use different model 

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('paraphrase-MiniLM-L3-v2')

documents = []
embeddings = []
metadatas = []
ids = []

for index, data in enumerate(file_data):
    documents.append(data['content'])
    embedding = model.encode(data['content']).tolist()
    embeddings.append(embedding)
    metadatas.append({'source': data['file_name']})
    ids.append(str(index + 1))

click +code to add for Creating a new collection

pet_collection_emb = client.create_collection("pet_collection_emb") pet_collection_emb.add( documents=documents, embeddings=embeddings, metadatas=metadatas, ids=ids )

Code to search again 

query = "What are the different kinds of pets people commonly own?" input_em = model.encode(query).tolist() results = pet_collection_emb.query( query_embeddings=[input_em], n_results=1 ) print(results)

Code to make a query about what foods are recommended for dogs

query = "foods that are recommended for dogs?"

input_em = model.encode(query).tolist()

results = pet_collection_emb.query(

query_embeddings=[input_em],

n_results=1

)

print(results)

Some more refernces https://blog.futuresmart.ai/chromadb-an-open-source-vector-embedding-database

AMET-SOLID

Monday, 8 April 2024

ANOVA - Primer

Worked Examples

Sunday, 7 April 2024

Vector Data Base - Primer using Chroma DB

Work Diary - 2025

Happy open and Distance Learning!

Blog Archive

I	II	III	IV
338	412	124	389
324	387	353	432
268	400	469	255
147	233	222	133
309	212	111	265

I	II	III	IV
338	412	124	389
324	387	353	432
268	400	469	255
147	233	222	133
309	212	111	265

I	II	III	IV
338	412	124	389
324	387	353	432
268	400	469	255
147	233	222	133
309	212	111	265