0:08

In this video we are going to talk about support vector machines.

Â Let's think again about the classification task we first saw.

Â This is a passage about some medical condition,

Â and we need to decide whether it is nephrology, neurology, or podiatry.

Â Looking at the words, like athlete's foot,

Â infection of the foot and so on,

Â we would know that this is podiatry.

Â Let's take another example.

Â And this is about sentiment analysis.

Â This is somebody's review for a movie,

Â and you could guess which movie it is since in sensibility- and you can

Â see that the the author or the reviewer here has given some stars and so on,

Â but then used words like wow and great movie and so on.

Â Using those words we could then identify that this particular review is positive.

Â If you have seen words like boring, or lame,

Â or worst, and so on,

Â it would be a negative review.

Â So again this is a sentiment classification task,

Â where looking at the words,

Â we had to determine whether it's positive or negative.

Â So in general, a classifier can be said to be a function on the input data.

Â So you have some function F,

Â that works on medical passages and labels,

Â or decides that what the class label should be.

Â Nephrology, neurology or podiatry.

Â Or another function looks at this review, a movie review,

Â and decides it's +1 that stands for positive review,

Â or -1 that stands for negative review.

Â It's typical to actually have numbers,

Â like +1 or -1 Assigned to classes,

Â and that's what support vector machines or any of that linear classifier, typically use.

Â When you're talking about support vector machines,

Â we need to talk about decision boundaries.

Â What is a decision boundary?

Â Let's take this example.

Â There are just two dimensions, X1 and X2,

Â and there are points,

Â leader points in this two dimensional in space.

Â Some of them are marked star, others are marked circle.

Â Let's call the circle ones as Class A,

Â and the stars as Class B.

Â Now you need to somehow make a boundary around one

Â of the classes to separate that from the other class.

Â You can use any decision boundary,

Â you could use one that is a circle,

Â or you could use one that's a square,

Â or you could have any other random shape,

Â in this case let's take a shape of a heart.

Â A classic fiction function is represented by this decision surface, or decision boundary.

Â And this choice of what shape you should use is completely dependent on this application.

Â If you think that points in

Â the two dimensional base are well represented using a circle, you would use that.

Â Whenever you use a decision boundary,

Â the main purpose is when you get a new point that is unlabeled,

Â depending on the decision boundary you have learned,

Â you can label it as one of the classes.

Â So in this particular example,

Â this point that is unlabeled,

Â will be marked as a star,

Â or Class B because that is outside this boundary that was learned around class A.

Â But when you're using a decision boundary,

Â there are other factors that should be considered.

Â And let's take another example where here you have positive and negative points,

Â so positive is one class,

Â negative is the other class.

Â And all the red positives and negatives define this training data that we have.

Â Looking at this data,

Â you could learn a boundary like this.

Â It's an irregular shaped Pentagon,

Â or in general any other polygon or closed surface that puts all the positive class,

Â on the positive points, inside this polygon,

Â and all the negative points are outside.

Â This is a perfectly great division boundary.

Â The accuracy on the training data is 100% Every point is labeled correctly.

Â But then, when we look at the test data,

Â represented by black positives and negatives here,

Â you would start seeing that yes in general there are

Â positive points near positive points that you saw earlier.

Â In general negative points are near negative points.

Â But then, because of the way we defined the division boundary,

Â there are many many more mistakes that we have done here.

Â There are many many positive points outside of the polygon.

Â The outside spear is supposed to be negative,

Â and there are points within this polygon, that are negative.

Â So in general, in unseen test data,

Â this decision boundary has not worked very well.

Â This problem is called data overfitting.

Â When you have learned a decision boundary that works very well on training data,

Â but does not generalize to test data.

Â So instead of a irregular shape boundary,

Â let's consider a linear boundary or a straight line.

Â So in a two dimension space,

Â it is a straight line.

Â In a three dimensional space.

Â It would be a plane,

Â in general and dimensional data representation, it would be a hyperplane.

Â So in this particular case,

Â you have a line that separates the positive region,

Â that's one half of this two dimensional space as positive,

Â and the other half as negative.

Â As you can see,

Â it's not accurate, it's not 100% accurate.

Â You have three negative points on the positive side,

Â and you have one positive point on the negative side.

Â So yes, it has made some mistakes in the training data,

Â but it's very easy to find,

Â because you have to basically learn a line that maximally separates these two points,

Â that that makes as fewer errors as positive, as possible,

Â and you can see that there is no one line that could have

Â separated the positives from negatives completely.

Â So this is as good as you can get.

Â It's also very easy to evaluate,

Â because now you have a very simple rule that any point in

Â this entire half-space and

Â half-region that's on the positive side is going to be called positive.

Â And any point on the other side is going to be called negative.

Â In general, this idea of using a simple classifier or a simple model to fit the data,

Â Is called Occam's Razor.

Â And the rule of thumb here is simple models generalize well.

Â You want to use something that is simple to explain,

Â that would work very well in the test data.

Â And as you can see, if you apply the test points on top of this,

Â you would notice that all test points have been correctly classified.

Â Except for one positive that's on the negative side.

Â So the error is really just one point on the test,

Â and of course there are four points in the training data that were misclassified as well.

Â But overall this is still a much better model than

Â this irregular shaped boundary where we had

Â seen earlier that quite a few positive points have been misclassified,

Â and two of the negative points were misclassified in the test data.

Â So how do you find this linear boundary?

Â You could use a lot of algorithms and methods to do so.

Â Typically, when you're finding a linear boundary,

Â you're finding W, or the slope of the line.

Â You basically want to find out

Â the coefficients that are the weights associated with the dimensions X1 and X2,

Â so W is typically of the same dimension,

Â same size as your data points.

Â And you are going to use a linear function

Â to see which point would be called positive or negative.

Â There could be many methods, perceptron,

Â a linear discriminant analysis,

Â or linear least square techniques and so on.

Â There are many methods to use,

Â so you could use some perceptrons,

Â you could use linear discriminative analysis,

Â or linear least squares.

Â And in one particular case, in this case,

Â we're going to talk about support vector machines.

Â The problem is that whenever you have

Â a linear boundary that separates the positive points and negative points,

Â if there is one line that splits the positives and negatives from each other,

Â there are in fact infinitely many lines.

Â So you could have a line like this,

Â or the line like this, or the line like this.

Â They all separate the positive points from the negative points.

Â Then the choice becomes what is the best line here.

Â So let's take another example,

Â very similar to what we have seen before.

Â Again two dimensions, positives and negatives.

Â And consider this line.

Â This line is one that goes from the farthest positive to the farthest negative.

Â By farthest I mean the ones that are kind of the most confusing,

Â that are closest to the boundary,

Â and this line perfectly splits the data as best as it can.

Â I mean it's not really 100% accurate,

Â as you can see that there are negative points on

Â the positive side and two points on the negative side.

Â But if you learn this line,

Â you could actually- you learned any other line.

Â And the question becomes what is a reasonable boundary?

Â So instead of the blue line,

Â if you consider this red band,

Â you would see that the points that are in the confusion are low.

Â What do I mean by that?

Â So in the previous example,

Â when you had just this blue line,

Â any small change to the positive point or a negative point will make an error.

Â If you make a small perturbation on the posture point,

Â let's take an example here,

Â and say this point,

Â is moved a little bit here, suddenly,

Â you're either making a mistake or this line has to change to now get here.

Â So you can see that a small change in

Â a data point can actually lead to a big change in the classifier.

Â However, if you use a line like this,

Â where you have a big band separating the points,

Â you would notice that any small change,

Â the same small change that you found here,

Â would still be on the same side as the hyperplane,

Â as what you would expect it to be.

Â So a small change in your data points would still not

Â change the label that this particular classifier will assign that.

Â So a small change- it's actually more- So it's more resistant to noise or perturbations.

Â This idea is called maximum margin.

Â You would want to learn the thickest band that you can fit,

Â that separates the positive points from negative points.

Â This band, or the thickness is called the margin,

Â and the middle line between them is called the actual hyperplane.

Â So that's the maximum margin hyperplane.

Â The idea here is that you are not basing your classifier on two points,

Â a positive and negative point,

Â but a set of points called support vectors.

Â And in general, the support vector machine is just maximum margin classifier.

Â Where it's not just one or two points that will make a big difference,

Â you actually have a lot of support to learn this margin or this band,

Â and these support vectors are the points that are the most sensitive to shift.

Â But even then, small changes or perturbations to

Â support vectors still would not change the classification,

Â because classification happens with respect to this maximum margin hyperplane.

Â So now the question becomes how do you find it.

Â Support vector machines uses optimization techniques to do it.

Â These are linear classifiers that find hyperplane to separate these two classes of data,

Â the positive and the negative class,

Â and basically given training data of the type X_1,Y_1,

Â X_2,Y_2 and so on,

Â where Xs are the representations of the data in terms of its features,

Â as opposed to any dimensional feature representation.

Â And you have the Ys as the class label,

Â which is one of two labels here, -1 and +1,

Â then SVM will find the linear function W,

Â a weight vector, such that the function of f, of X_i,

Â is this dot product of W and X_i,

Â so W.X_i with some bias done,

Â but the idea is that if the value for a new X_i.

Â If the function leads to a positive value,

Â of X_i is greater than zero,

Â then the label is +1, otherwise the label is -1.

Â You have seen that this SVM tend to work only for binary classification problems,

Â and we have seen an example of this,

Â +1 and -1 so far,

Â but then what happens when you have multiple classes?

Â Let's take this example of the three classes, podiatry, nephrology, neurology.

Â In general, when you were to use the SVM in a multiclass classification set up,

Â you would want to learn it in one of two scenarios.

Â One is called One versus rest.

Â Where you are actually learning

Â binary classifiers between one class and all the other classes together.

Â For example, this is the classifier

Â between nephrology and all documents that are not nephrology.

Â Assume that d1 to d9 are all nephrology documents.

Â So this classification is a perfect classification to learn nephrology,

Â as these nine documents,

Â and anything that is not nephrology is on the other side.

Â Then, we learn another classifier,

Â this time for neurology.

Â Where the documents d10 to d15 are neurology documents,

Â and everything else, d1 to d9,

Â and d16 to d21,

Â are all not neurology.

Â And then, we do the same thing with podiatry.

Â You have the documents d16 to the d21,

Â that is labeled podiatry, and everything else.

Â That's d1 to d15,

Â is going to be not podiatry.

Â So in general, we are learning three classifiers,

Â or for n class setup,

Â we are learning n classifiers,

Â such that all the points in one region,

Â let's take the region as this,

Â D16 to d21, ist' going to be not nephrology,

Â yes on podiatry, And what's the other one.

Â Yes, it's not on neurology.

Â So it's not neurology,

Â it's not nephrology, and it's podiatry.

Â So this is going to be podiatry.

Â Right? Let's take another set up,

Â and this set up is where instead of learning one versus rest,

Â we are going to do one versus one.

Â That means that you're going to learn classifiers

Â between let's say nephrology and neurology.

Â So in this case you look at only documents that were labeled nephrology or neurology.

Â That's d1 to d9, and d10 to d15, and separate them.

Â So the red line just separates these two points.

Â These two classes, I'm sorry.

Â And then you learn another classifier.

Â This is between neurology and podiatry,

Â that works on a smaller set.

Â Again it's between the d1 to d9 as the neurology documents,

Â and d16 to d21 as the podiatry document.

Â And again the third one,

Â which is between nephrology and podiatry.

Â When you have these three classes together,

Â you can see that in general,

Â for an n-class set up, there are (n,2).

Â A combonotorial number n_square number of classifiers.

Â It so happens that for three classes,

Â it is three classifiers,

Â But for four classes it'll likely to be six classifiers, and so on.

Â But then, you can still use the same idea where you have these points,

Â going to be called between the class labels podiatry and neurology, it's podiatry.

Â Between the classes podiatry and nephrology.

Â It's podiatry, and between the classes nephrology and neurology,

Â it's basically nothing, because it's right in the middle, right?

Â But you still see that sometimes it's going to be called nephrology,

Â sometimes it's going to be called neurology,

Â but the overall votes are still in favor of podiatry because you have

Â two- that all of these points you're going to get

Â two votes on podiatry because of the other two classes.

Â And maybe one additional vote for nephrology or neurology.

Â In any case, all of these points are still going to be classified correctly as podiatry.

Â So both of these,

Â in this particular toy dataset have been labeled the right way,

Â but just the idea and the approach is very different.

Â In one case you're learning one class against all the other classes,

Â and in the other case it's one versus one and do n_square a number of classes.

Â Or reference squared number of classes this way.

Â Okay, so let's look at one of the parameters when we learn a support vector machine.

Â One of the most critical parameters becomes parameter C,

Â and that defines regularization.

Â Regularization is a term that denotes how

Â important it is for individual data point to be labeled correctly,

Â as compared to all the points in the general model.

Â So how much importance should we give to individual data points?

Â This parameter for example,

Â if you have a larger value of c,

Â that means the regularization is less,

Â and that means that you are fitting the training data as much as possible.

Â You are giving the individual points a lot of importance,

Â and every data point becomes important.

Â You want to get them right,

Â even if the overall accuracy is low,

Â or generalization error is high.

Â Whereas smaller values of c means that are more regularization,

Â where you are tolerant to errors on individual data points,

Â as long as in the overall scheme you are getting simpler models let's say.

Â So the generalization error is expected to be low.

Â The default values in most of the packages is one.

Â But you could change that parameter depending on how important you

Â want your individual data point classifications to be.

Â The other parameters are typically with respect to

Â what is the type of a decision boundary you would want to learn.

Â We've talked about linear kernels a lot here.

Â Linear decision boundaries, but you could have a polynomial kernel,

Â or a regular basis function kernel and so on.

Â In fact we have talked about these in course three,

Â as part of this specialization.

Â So I would refer you to some of the slides and discussion

Â around that in the previous course, in the course three.

Â The other parameter for SVM is whether it's multiclass or not,

Â and if it is multiclass,

Â which one would you use,

Â so OVR would be the parameter setting for one versus rest,

Â and there are other parameter options possible.

Â As you can see with one versus rest,

Â you are learning fewer number of classifiers.

Â So that's preferred over one versus one.

Â The other important factor and a parameter would be class weight.

Â So different classes could get different weights.

Â For example, if you want a particular class,

Â let's say spam or not spam,

Â and you know that the spams are usually like 80% of e-mails somebody gets.

Â So then he would want,

Â that because it's just such a skewed distribution where one of the classes 80% and

Â the other classes 20% you would want to give different weight to these two classes.

Â In general these these parameters are possible to set in in python models,

Â and we'll talk about exact parameter settings in another video.

Â But to conclude, the big take home messages from support vector the machines is that,

Â these tend to be most accurate classifieds for text,

Â especially when we are talking about high-dimensional data,

Â as text data typically is.

Â There are strong theoretical foundations for support vector machines.

Â It's based out of optimization theory,

Â which were not talked about in any detail in this video,

Â but I would encourage interested folks to go and check it out.

Â We'll add some of them in the reading list for this video.

Â Support vector machine handles only numeric data.

Â So what would you do with other data?

Â You typically can word these categorical features into numeric features.

Â As you can see this is a dot product based algorithm.

Â So it uses numbers to define whether it should be boundary on one side or the other.

Â So it works only on numeric features.

Â You would also want to normalize the features.

Â That means you don't want some dimensions to be very high in numbers,

Â and other dimensions to be very low.

Â You would want to put them all in a zero to one range,

Â so that each feature gets

Â enough importance and you can learn the relative importance that way.

Â But in general, the hyperplane that you learn even that

Â they're very easy to learn and easy to test,

Â they are hard to interpret.

Â Which means that you don't really know

Â why a particular point has been labelled as positive or negative.

Â It's a combination of multiple factors.

Â But in general if explanation is not a critical requirement for the classifier,

Â support vector machines do give you very high and very accurate classifiers,

Â very similar to naive bayes.

Â In fact for a lot of text classification problems,

Â support vector machines should be one of the first ones you should try.

Â