This course provides an introduction to the field of Natural Language Processing. It includes relevant background material in Linguistics, Mathematics, Probabilities, and Computer Science. Some of the topics covered in the class are Text Similarity, Part of Speech Tagging, Parsing, Semantics, Question Answering, Sentiment Analysis, and Text Summarization.
The course includes quizzes, programming assignments in Python, and a final exam.
Course Syllabus
Week One (Introduction 1/2) (1:35:31)
Week Two (Introduction 2/2) (1:36:26)
Week Three (NLP Tasks and Text Similarity) (1:42:52)
Week Four (Syntax and Parsing, Part 1) (1:48:14)
Week Five (Syntax and Parsing, Part 2) (1:50:29)
Week Six (Language Modeling and Word Sense Disambiguation) (1:40:33)
Week Seven (Part of Speech Tagging and Information Extraction) (1:33:21)
Week Eight (Question Answering) (1:16:59)
Week Nine (Text Summarization) (1:33:55)
Week Ten (Collocations and Information Retrieval) (1:29:40)
Week Eleven (Sentiment Analysis and Semantics) (1:09:38)
Week Twelve (Discourse, Machine Translation, and Generation) (1:30:57)
The course assignments will all be in Python.
Course Format
The class will consist of lecture videos, which are typically between 10 and 25 minutes in length. The lectures contain 1-2 integrated quiz questions per video. Grading is based on three programming assignments, weekly quizzes, and a final exam.

From the lesson

Week Three: NLP Tasks and Text Similarity

Week Three will cover Vector Semantics, Text Similarity, and Dimensionality Reduction. I will also go through a long list of sample NLP tasks (e.g., Information Extraction, Text Summarization, and Semantic Role Labeling) and introduce each of them briefly.

Professor of Information, School of Information, Professor of Electrical Engineering and Computer Science, College of Engineering, and Professor of Linguistics, College of Literature, Science, and the Arts College of Engineering, School of Information, School of Literature, Science and the Arts

So the next topic in text similarity is called dimensionality reduction.

Â

The motivation behind dimensionality reduction is that a document is often

Â

not about many different topics but just a relatively small number of topics,

Â

certainly smaller than the number of words that appear in it.

Â

So if we can somehow collapse some of the words into semantic categories,

Â

we may be able to find better similarity measures.

Â

For example if we can collapse all the documents about patients, doctors, and

Â

hospitals into one place, we may consider them to be semantically similar,

Â

even though they contained different words in that set.

Â

So the simpler vector approaches to similarity that we looked at so

Â

far have a few problems.

Â

In many cases there's a polysemy that's why the similarity,

Â

the actual similarity between the words are smaller than what the cosine

Â

similarity would make you believe.

Â

That includes words with multiple sentences such as bar, bank,

Â

jaguar, or hot.

Â

The opposite is due to synonymy, that's when the actual similarity between

Â

the words is larger than what the cosine similarity would make you believe.

Â

So a word like building is a synonym of edifice.

Â

They're different words, therefore they're going to be in different contexts,

Â

but even though they are in different contexts, they have the same meaning.

Â

So we want somehow to know those synonyms, and in that case,

Â

overestimate their cosines similarity.

Â

The matrix between words and sentences in general, are very sparse.

Â

And it needs to be processed through dimensionality reduction, so

Â

that we can find some hidden semantic dimensions of it.

Â

Let's look at some examples of some of the natural language processing literature.

Â

So the example on the left gives you multiple choice questions from

Â

the TOEFL test, you're given the words,

Â

specifically levied, and you're asked which word is most similar to it.

Â

So the choices are A, imposed.

Â

B, believed.

Â

C, requested.

Â

D, correlated.

Â

And the answer is that the most similar word to levied is imposed.

Â

So this is the kind of semantic relationship that we want to discover

Â

in text.

Â

And this is where letter and semantic analysis,

Â

the technique that we're going to introduce today, is going to help us.

Â

The same technique is also used to identify similar analogies.

Â

So an example from the SAT test is, you're given the pair mason to stone and

Â

you're asked which of the following five choices represents

Â

the most similar relationship as the one between the mason and the stone.

Â

So the answers are A, teacher to chalk.

Â

B, carpenter to wood.

Â

C, soldier to gun.

Â

D, photograph to camera, and E, book to work.

Â

And the correct answer here is carpenter to wood because just like a mason is

Â

a person who works stone, a carpenter is a person who works wood.

Â

All the other analogies are different.

Â

So it turns out that this problem of analogy

Â

similarity can also be resolved by dimensionality reduction techniques.

Â

And a lot of this work was actually done by

Â

Peter Turning in his papers from the last ten years.

Â

So letâ€™s now consider dimensionality reduction in more detail.

Â

So the purpose of this method is to look for hidden similarities in data.

Â

It's based on matrix decomposition, and

Â

I'm going to introduce it by giving you an example from some high school,

Â

where people measured the heights and the weights of the students in the school.

Â

And the scatter plot on the left shows you the different students who were measured.

Â

The x-axis represents, let's say, the height, and

Â

the y-axis represents the weight.

Â

Now, we can find the regression line that explains this data the best,

Â

that's shown in the middle with a red line that appears diagonally.

Â

Now, it turns out that there is a third variable in addition to height and weight

Â

that can explain the differences in height and weight between the different students,

Â

and that is exactly what our regression line shows you.

Â

This line corresponds to the dimension of age.

Â

So it turns out that the sample students in that high school was not

Â

students of the same age.

Â

It was students from across all different classes and age groups.

Â

So obviously the students in the lower grades were both lighter and

Â

shorter than the students in the upper grades.

Â

But if we collapse each of those points on the diagonal axis that represents

Â

the regression line, you will see that there's actually a very nice trend.

Â

Students who are older are both taller and heavier than younger students.

Â

So in this process we're not losing much information.

Â

It turns out that we can replace height and weight with age, and

Â

gain most of the information that appears in the data set on the left.

Â

And everything that is different between the two examples tells us

Â

how a particular student differs from the trends, so a person who is too tall for

Â

their age or somebody who's too short for their age.

Â

So what we did here was to reduce the dimensionality of our

Â

data set from two dimensions to one dimension.

Â

So how is this done in practice?

Â

Well, let's go back to a little bit of linear algebra.

Â

We need to remember how vectors and

Â

matrices work in order to understand dimensionality reduction.

Â

So a matrix is an m x n table of objects, in our case, those objects are numbers.

Â

Each row and also each column of a matrix is a vector.

Â

Matrices of compatible dimensions, and

Â

there's a special definition of compatibility in this context.

Â

So matrices of compatible dimensions can be multiplied together.

Â

And if you don't remember this kind of math, you should go and

Â

visit some website that explains how to multiply matrices and

Â

then come back here and try to multiply the two matrices in the example below.

Â

So the next slide, I'm going to show you the answer.

Â

So the way that matrices are multiplied is very simple.

Â

You just multiply the values in the first row with the value in the first column.

Â

And then you add them up, and the result goes into the cell that corresponds

Â

to the first row and the first column in the product.

Â

So the first row of the product is 1 x 2

Â

+ 2 x 1 + 4 x (-1), or a total of 0.

Â

The second row is 2 x 2 +5 x 1 + 7 x (-1) which is 2.

Â

And finally 4 x 2 + 9 x 1 + 14 x (-1) is equal to 3.

Â

So the product for those two matrices is the vector 0, 2, 3.

Â

Now one other important concept of linear algebra that is related to dimensionality

Â

reduction is that of eignenvectors, and eigenvalues.

Â

So an eigenvector is an implicit direction for a matrix.

Â

So if we multiply a vector v, which is the eigenvector,

Â

to the right-hand side of a matrix A, we're going to obtain the same result as

Â

if we had multiplied v with lambda, which is a scalar, like an eigenvalue.

Â

In principle, the eigenvalue lambda can be any complex number, but for

Â

the examples that we are going to look at, it will always be real.

Â

So to compute the eigenvalues of a matrix, we need to use the following equation.

Â

We need to find the determinant of the matrix,

Â

A minus lambda I where I Is a unit matrix of the same size as the matrix A.

Â

And we want this to be a square matrix obviously, so that we can

Â

compute the determinant and we want to set this determinant to be equal to 0.

Â

So, let's look at an example.

Â

If our matrix is A is minus 1 3 and then 2 0,

Â

A minus lambda I is the matrix shown to the right.

Â

It's minus 1, minus lambda in the first row followed by a 3.

Â

And then 2, and minus lambda in the second row.

Â

So, if we want to compute the determinant of this matrix,

Â

we need to find the product of the forward diagonal,

Â

and then subtract the value of the product of the numbers on the backwards diagonal.

Â

So that gives us (-1-lambda)*(-lambda), and

Â

then the whole thing- 3 x 2 =0.

Â

If we solve this quadratic equation, which is lambda + lambda squared- 6 = 0.

Â

We're going to see it has two roots.

Â

Lambda 1 = 2, and lambda 2 which is = -3.

Â

So if we pick one of those eigenvalues and replace them in the equation on the top.

Â

We're going to get a new matrix -3 3 2 -2,

Â

which then has to be multiplied by the eigenvector X1 X2 and

Â

we want the result to be equal to 0.

Â

Well, if you solve the system of equations,

Â

you will find out that it's answer is x1=x2.

Â

So any two dimensional vector where the x and

Â

y coordinates are the same would satisfy the second equation.

Â

So if a matrix is square, it can be decomposed to into U lambda U inverse.

Â

Where U is its matrix of eigenvectors, and

Â

lambda is a diagonal matrix of eigenvalues.

Â

And you can do different transformations, which are all mathematically equivalent.

Â

So sigma U = U lambda, U inverse sigma U = lambda,

Â

and sigma = U lambda U inverse.

Â

So here's an example.

Â

If the original matrix S = 2, 1, 1, 2,

Â

it's 2 added values are lambda 1 = 1 and lambda 2 = 3.

Â

And then if we do this decomposition,

Â

we're going to get that U is equal to the matrix 1, 1- 1, 1.

Â

It's inverse U- 1 is equal to the matrix one-half-

Â

one-half followed by one-half one-half.

Â

And you can verify yourselves that we can indeed cover the original

Â

matrix S by multiplying U with lambda and then by U inverse.

Â

You can do this multiplication in sequence and you will obtain the original matrix S.

Â

So what happened here is that we are now able to convert the original matrix

Â

into a new space of old Eigen vectors, and then each point in the original

Â

space is going to be represented as a point in this new representation.

Â

So in the case of the weight and height,

Â

you would have a new dimension that corresponds to the age, and

Â

another dimension that corresponds to the deviation that the certain

Â

person has given their weight, based on the trend line defined by the age.

Â

Now if the matrix is not square,

Â

we have to use a different technique called singular value decomposition.

Â

So why do we care about non square matrices?

Â

Well, because in most NLP information retrieval tasks, we have matrices of

Â

either documents and terms, or words and their context features.

Â

And those matrices by definition are not necessarily square.

Â

We can have for example a vocabulary of size 1 million, and

Â

a set of 10 million documents.

Â

In which case we are going to have a matrix of 1 million by 10 million,

Â

which is clearly not square.

Â

So in that case we use another technique called singular value decomposition,

Â

in which case our matrix A is represented as the product U sigma V transposed, where

Â

U is the matrix of orthogonal eigenvectors of the pole that AA transposed.

Â

V is the matrix of orthogonal eigenvectors of A transpose A.

Â

And the components of sigma are the eigenvalues of A transpose A.

Â

So this decomposition exists for all matrices, whether they're dense or sparse.

Â

We can estimate the dimensionality of the different matrices.

Â

For example, if A has 5 columns and 3 rows, then the matrix U of the orthogonal

Â

eigenvectors of A A transpose will be a 5x5 matrix, V, the matrix

Â

of orthogonal eigenvectors of A transpose A will have a dimensionality of 3x3.

Â

Now in Matlab and Octave, you can use a very simple function, svd,

Â

give it the matrix as input, and it will return a tuple consisting of U, S,

Â

and V, where those match exactly U, sigma, and V in our example.

Â

Let's look now at a specific example.

Â

We have a collection of seven documents and nine terms, and

Â

we can look at what terms appear in which documents.

Â

So for example, document 1 contains terms 6 and 9,

Â

document 2 contains terms T1 and T2, and so on.

Â

So we can also look at this posting list, and

Â

represent the data in the form of a bipartite graph.

Â

A bipartite graph has two components, one on the left and

Â

one on the right in this example, that correspond to different types of objects.

Â

So the left hand side, or left hand mode, corresponds the documents,

Â

the right hand side corresponds to the terms.

Â

So we have, for example, a connection between D1 and D6, And

Â

D1 and D9, but we don't have a connection between D1 and D7.

Â

Okay, so now

Â

let's see how we can compute the singular body composition of a arbitrary matrix.

Â

We're first reppresent it in whole form, like the example on the left.

Â

Where one is that a certain term appears in a certain document.

Â

And zero means it's absent from that document.

Â

And then we need to normalize this matrix by dividing all the values

Â

by the length of the column vector that they are part of.

Â

So that means that in the first column we have two ones.

Â

Therefore, the length of the vector that corresponds to the first column is going

Â

to be square root of 2.

Â

So if we divide each of the values in that column by the square root of 2,

Â

we are going to get 0.71 and 0.71.

Â

The 2nd column vector has 3 1s,

Â

therefore its length is going to be square root of 3.

Â

And if we divide 1 by square root of 3 we're going to get 0.58.

Â

And you can see that

Â

this is the technique that we use to compute all the other normalized values.

Â

So we have to normalize the matrix before we can compute

Â

its singular value decomposition.

Â

So once we do this, we can just enter the matrix in MATLAB or

Â

our other favorite software, and run the SVD library.

Â

And we're going to get something like this.

Â

U is going to be just going to be a 9x9 matrix.

Â

V is going to be a 7x9 matrix.

Â

S is going to be a 7x9 matrix that responds to

Â

the spread on the different axes of the data.

Â

So the first dimension here has a spread of 1.58.

Â

That is the most important dimension in the lower dimensionality representation.

Â

The second value is 1.27 and so

Â

those are the singular values that appear in the sigma matrix.

Â

So one thing that we can do here is we can reconstitute the original

Â

matrix a by multiplying U by sigma and V1 transposed.

Â

But we can also produce a different version of A,

Â

specifically A star, which is a lower dimensionality version of A.

Â

By instead of multiplying U with sigma and V transpose,

Â

we multiply U with sigma star and V transpose, where sigma star only

Â

keeps the largest singular values of the original sigma matrix.

Â

So if you go back to the previous slide, we can essentially delete the 5th and

Â

the 6th line, the 0.5692 and the 0.1977 and

Â

maybe the 0.71 without losing much of the information stored in this matrix and

Â

we can just keep the four most significant dimensions

Â

that correspond to the four largest values in sigma.

Â

So this is the Rank-4 Approximation of sigma,

Â

as you see it has a few more zeros than the original matrix.

Â

And now, if we multiply U with sigma 4 and V transpose,

Â

we're going to get a different representation of A, which is not going to

Â

be what we had at the beginning, but it will be significantly different for A.

Â

If we do this further, we can now compute the representations of the terms and

Â

the documents in the new semantics space and

Â

this is done by appropriate multiplications of matrices.

Â

For example, U times S4 or S4 times V prime.

Â

So this gives us the representations of the documents and

Â

the terms respectively in the new semantic space.

Â

Now we can also take this a step further and

Â

compute the Rank-2 Approximation of the original matrix by just preserving

Â

the two largest values of the sigma matrix, 1.58 and 1.27.

Â

And then if we do the same operators as before,

Â

we can compute a new value for A, which is its Rank-2 Approximation.

Â

So in the Rank-2 Approximation, we can now use U multiplied by S2 to get

Â

the word vector presentation in concept space, now the two concepts.

Â

And we can also use s2 times v transposed to find the new concept

Â

representation of the document.

Â

So in this case, we again have two dimensions.

Â

And now here's a slide that summarizes the entire

Â

singular value decomposition method.

Â

We have the documents shown on the top-left and

Â

the terms on the top-right in this new semantic space.

Â

So as you can see, there are two clusters of both documents and terms.

Â

I'm going to focus on one of them.

Â

In the bottom-left area of the screen, you can see that T3,

Â

T7 appear near each other and D6 and D7 also appear near each other.

Â

So we, essentially kill two birds with one stone.

Â

Firs, we found that there is some latent semantic similarity between the terms

Â

T3 and T7 and also similarity between documents D6 and D7.

Â

But we also,

Â

we're able to achieve something that we couldn't do in the original space.

Â

Namely to find that documents six and seven are similar for

Â

terms three and seven in this new concept space.

Â

So if you look at the smallest circle in purple,

Â

you will see that D6 is the closest match to T3 and T7 and

Â

then the entire cluster of elements in this quadrant,

Â

including T3 T7, D6 and T1 are also all semantically connected.

Â

And this is actually something that you can understand better,

Â

if you look at the original example and specifically in its graph representation.

Â

You can see that D6, which is one of the documents that appears in

Â

the purple cluster here is represented in fact as T3 and T7.

Â

D7 is similar to D6 bit for it has a lot of common terms with D6 and

Â

then if we continue expanding this recursive,

Â

you will see that D2 is the next small similar document followed by D4 and so on.

Â

And this is exactly the intuition that you will get

Â

by looking at the graph representation on the right.

Â

And now for those who have more patience, there are a few more formulas in math

Â

lab that allow you to seek out to translate individual documents and

Â

terms to concepts.

Â

So you just have to multiply either an entire column vector or

Â

an entire row vector with a singular value matrix.

Â

So now, I have a question for you.

Â

If A is a document to term matrix, what are A*A transpose and A transpose *A?

Â

Let's start with A*A transpose.

Â

As you can see, A*A transpose is a 9 by 9 matrix,

Â

which is not surprising given that the original matrix was 9 by 7 and

Â

we multiplied it by transposed, which is 7 by 9.

Â

And therefore, the result should be 9 by 9.

Â

Similarly for A transpose *A,

Â

we have a matrix that is a seven by seven.

Â

And again, this is not surprising,

Â

because we multiplied a 7 by 9 matrix with a 9 by 7 matrix.

Â

But the dimensionallities of those two products should give you a hint as to what

Â

they mean.

Â

So it turns out that A transpose *A is the document to document similarity matrix.

Â

So for example, it tells you that the similarity between document 1 and document

Â

2 is 0, whereas the similarity between document one and document four is 0.639.

Â

If you go to the previous slide,

Â

we can see that A*A transpose is a term to term similarity matrix.

Â

So the similarity between term 1 and term 2 is 0.3364.

Â

The similarity between term two and term three is zero and so on.

Â

So this is all based on the contexts in which those documents appear and

Â

the terms appear.

Â

So let's now wrap up the section.

Â

We discussed the technique for

Â

dimensionality reduction called latent semantic indexing or LSI,

Â

which is in some papers is called LSA as in latent semantic analysis.

Â

This means pretty much the same thing.

Â

So dimensionality reduction is used to identify hidden or

Â

latent concepts in textual spaces and it is used for

Â

a variety of NLP tasks, information retrieval tasks,

Â

including, but not limited to query matching in latent space.

Â

So here now, two external pointers about latent semantic indexing or

Â

if you prefer latent semantic analysis,

Â

the two of the sites that have been the most active in this field.

Â

Colorado and the University of Tennessee in Knoxville.

Â

So this concludes the sections, text similarity using singular

Â

value decomposition or dimensionality reduction and

Â

the next topic is going to be on text similarity using text kernels