[MUSIC]

In this session, we're going to introduce cosine

similarity as approximate measure between two vectors,

how we look at the cosine similarity between two vectors, how they are defined.

So we can take a text document as example.

A text document can be represented by a bag of words or

more precise a bag of terms.

Each document can be represented as a long vector,

each attribute recording the frequency of a particular term.

The term can be a word or it can be a phrase.

5 means the term team, actually occurring in Document 1, five times, okay.

So, we want to compare the similarity between Document 1 and Document 2.

So how similar they are?

We could use cosine similarity to do that.

Other vector objects like gene features in micro-arrays can be

represented in the similar way as a long vector, 'kay.

For information retrieval, biological taxonomy,

gene feature mapping, like a micro-array analysis,

these are good applications to compare similarity between two vectors.

The cosine measure is defined as follows.

For example,

we can consider the term-frequency vector to look at their similarity.

They are defined by dot product of these two

vectors divided by the product of their lengths.

So we look at the, the cosine similarity definition, and take as an example.

Suppose d sub 1 and d sub 2 are the vectors of

these two documents, okay, then we can calculate

their vectors dot product as follows, 'kay.

Then we can calculate the length of each one of d sub 1's lengths,

calculate using this formula, okay.

D sub 2's two lengths eh,

can be calculated also using the square root of sum of their product.

Then the cosine similarity can be calculated using the formula given above.

We can see their cosine similarity is 0.94,

simply says these two documents are quite similar.

[MUSIC]