In this part of the lecture we'll introduce support vector machines. Now, support vector machines is the term that commonly refers to a set of classifiers of a certain type, and we'll look at all three in this lecture. We'll look at a maximal margin classifier, a support vector classifier, and what's called a support vector machine which is an extension of the support vector classifier. These are three different types of classifier that rely on very similar concepts. The first two types of classifiers, the max margin and the support vector classifier are linear classifiers. Representing points in two-dimensions, we might assume that there's a linear decision boundary that would allow us to say that points on one side of that decision boundary are of a certain type and points on the other side of the decision boundary are of another type. In general, support vector machine type classifiers attempt to separate data points in d dimensions into separate classes using hyperplanes that divide the points into those classes. Now, in d dimensions, a hyperplane is a flat affine subspace of dimension d minus 1. In two-dimensional space the hyperplane is simply a line. In three-dimensions that's a plane and this notion is hard to visualize in higher dimensions, but the notion is still well-defined. That hyperplane is essentially dividing a d-dimensional space into two halves, such that we can say that points on one side of that decision boundaries are of one type and points on the other side of the decision boundary are of another type. Here's an example hyperplane in two-dimensions, which we could represent with the following equation, and the classifier is fairly straightforward. We could say that for any point, that 1 plus 2 times x_1 plus 3 times x_2 is less than 0, that would place it below the green line. Similarly, if that computation resulted in a quantity that was positive, it would put it above that line. Now, we can formalize what we just saw in terms of multiple dimensions. Saying that if points in p dimensions result in a computation that is positive, that would result in a class of a certain type. Similarly, the same computation if negative would result in a classification of a different type. These two relationships can be expressed in the closed form as follows, such that the class of y is simply the side of the hyperplane where the point lies. Now, given a set of points, there are many possible ways to separate those points with hyperplanes, and ultimately we need to pick just one of those separating hyperplanes. The question then is given multiple choices for how to separate these points using a hyperplane, which separator is the best? One way of defining the best separator or the best hyperplane is what's called maximum margin. The idea here is that you want a separating hyperplane that is farthest from training observations. That is you don't want to any points that are close to that decision boundary. Intuitively what that's doing is creating the widest straight band that separates those two classes. Now, the maximum margin hyperplane has some interesting properties. In particular, the plane itself depends only on a small set of the points, so the three points that I've circled here represent the points that are closest to that separating hyperplane in terms of a perpendicular distance, and we call those the support vectors. Interestingly, the precise values of the remaining points in this dataset don't affect the positioning of that separating hyperplane. Formally, here is the optimization that the maximum margin hyperplane attempts to solve, so we're trying to maximize that margin, that is that band that separates the two classes, subject to some constraints on our coefficient such that the points themselves lie on either side of that hyperplane but are at least a distance of the margin M away from that hyperplane. These constraints guarantee two properties. One is that the points with class y lie on the correct side of the hyperplane, and second, that each of those points on either side of the hyperplane are at least a distance M from the hyperplane. Now, I won't derive how to solve this optimization in this part of the lecture, but it's a lot simpler than it looks. There are some problems with the straight linear support vector classifier. One is that a separating hyperplane may not exist. Now, the max margin classifier has some problems. One is that a separating hyperplane may not exist, that is any hyperplane that I create in this p-dimensional space through these points may have points on the wrong side of the decision boundary. Second, even if such a hyperplane exists, it may have very small margins and I may be interested in a hyperplane that has wider margin between the hyperplane and the data points, even if for some small set of data points, some of those points lie on the wrong side of the hyperplane. The solution to these problems is to create what's called a support vector classifier or a soft margin classifier. The idea here is that we're going to relax that constraint that we previously saw, allowing some of the training samples to be on the wrong side of the decision boundary. Note that the support vector classifier is inevitable when no separating hyperplane exists. To review, the goal of a support vector classifier is to find the optimal separating hyperplane that maximizes the margin of the training data, where the margin here is shown by that green arrow that spans the decision boundary. If we're going to relax the optimization that we've already seen, we'll add a multiplier to the margin that allows points that are within some fraction of that margin to lie on the other side of the hyperplane. This parameter C is sometimes called the slack in a support vector classifier and it is a parameter that needs to be tuned. Here's how tuning that parameter C affects the bias-variance tradeoff. Larger values of C create wider margins allowing more points to lie on the wrong side of the hyperplane, whereas smaller values of C have narrower margins resulting in a decision boundary that's more closely fit to the training data. Now, as I've said before, an interesting property of SVM is points on either side of the classifier do not affect the separating hyperplane. Here are some illustrations of the effect of C on our resulting hyperplanes that one might get with the same dataset. Large values of C result in a wider margin and smaller values of C result in narrower margins, and thus lower tolerance to points that lie on the wrong side of the boundary. Now, what if a linear decision boundary isn't what you want? What if there's some more complex decision boundary that you'd like to express? Now, that's where the more general concept of a support vector machine comes into play. Here we're going to extend the linear support vector classifier to allow for non-linear boundaries between classes. If we can express the linear support vector classifier as some intercept plus a weighted sum of all the dot- products between all pairs of the training observations, and note that in a solution to the support vector classifier problem, and if we call the non-zero Alphas the support, then we can recognize that the distance between each of these pairs of training observations which we've expressed here as a dot-product can be generalized to what's called a kernel function. We'll replace the dot-product computation with the kernel function that is simply a different quantification of the similarity between two observations. Two common kernels that are used are the polynomial kernel, which is expressed as follows and has a tuning parameter D, and the radial curve. Here you can see a couple of examples of where kernels of different shapes allow separation along more complex decision boundaries with an SVM trained with the polynomial kernel shown on the left and with a radial kernel shown on the right. In summary, support vector machines represent a set of classifiers, the maximum margin classifier, the support vector classifier, and the support vector machine. The goal of all of these classifiers is to compute a hyperplane that maximizes the separation between the classes in the training set. The support vector classifier involves a linear decision boundary, but it's possible using support vector machines to represent complex relationships and decision boundaries through different types of kernels.