0:00

[MUSIC]

Â In this session, we're going to introduce cosine

Â similarity as approximate measure between two vectors,

Â how we look at the cosine similarity between two vectors, how they are defined.

Â So we can take a text document as example.

Â A text document can be represented by a bag of words or

Â more precise a bag of terms.

Â Each document can be represented as a long vector,

Â each attribute recording the frequency of a particular term.

Â The term can be a word or it can be a phrase.

Â 5 means the term team, actually occurring in Document 1, five times, okay.

Â So, we want to compare the similarity between Document 1 and Document 2.

Â So how similar they are?

Â We could use cosine similarity to do that.

Â Other vector objects like gene features in micro-arrays can be

Â represented in the similar way as a long vector, 'kay.

Â For information retrieval, biological taxonomy,

Â gene feature mapping, like a micro-array analysis,

Â these are good applications to compare similarity between two vectors.

Â The cosine measure is defined as follows.

Â For example,

Â we can consider the term-frequency vector to look at their similarity.

Â They are defined by dot product of these two

Â vectors divided by the product of their lengths.

Â So we look at the, the cosine similarity definition, and take as an example.

Â Suppose d sub 1 and d sub 2 are the vectors of

Â these two documents, okay, then we can calculate

Â their vectors dot product as follows, 'kay.

Â Then we can calculate the length of each one of d sub 1's lengths,

Â calculate using this formula, okay.

Â D sub 2's two lengths eh,

Â can be calculated also using the square root of sum of their product.

Â Then the cosine similarity can be calculated using the formula given above.

Â We can see their cosine similarity is 0.94,

Â simply says these two documents are quite similar.

Â [MUSIC]

Â