In this video we will learn a machine learning method called Support Vector Machines (or SVM), which is used for classification. So let’s get started. In this video we will discuss Kernels, the Maximum Margin. Kernels. A dataset is linearly separable if we can use a plane to separate each class, but not all data sets are linearly separable
Consider the following data points, the data is not linearly separable we can still see both colors overlap into each other Same, in this example. The data is not linearly separable We can “transform the data” to make a space where it’s linearly separable. For the sake of simplicity, imagine that our dataset is 1-dimensional data, this means, we have only one feature x. As you can see, it is not linearly separable. We can transfer it into a 2-dimensional space. For example, you can increase the dimension of data by mapping x into a new space using a function, with outputs x and x-squared. Now, the data is linearly separable, right? Notice that, as we are in a two dimensional space, the hyperplane is a line dividing a plane into two parts where each class lays on either side. Now we can use this line to classify the dataset. Sometimes its difficult to calculate the mapping We use a short cut called a kernel, there are different types, such as: - Linear, - Polynomial,
- Radial basis function (or RBF), The RBF is most widely used. Each of these functions has its own characteristics, its pros and cons, the RBF kernel finds the difference between two inputs X and X’ that is called a support vector. The RBF kernel has the parameter Gamma, lets see how we select Gamma Consider the following dataset of cats and dogs, anything in this region is a dog, anything in this region is a cat, anything in this region is a dog, anything in this region is a cat Therefore any sample in the red or blue region should be classified accordingly, unfortunately you can’t find a plane that separates the data Here we use a plane to separate a similar dataset, it does not separate the data, lets try the RBF kernel. Using a value of gamma of 0.01 increases the flexibility of the classifier Using a value of gamma of 0.1 seems to be classifying more points The value of 0.5 classified almost all the points correctly but it does not seem to match the regions in the original slide, This is known as overfitting, where the classifier fits the data points not the actual pattern, higher gamma the more likely we will over fit, let's clarify this with an example The following images of cats look like dogs, they could be mislabeled or the photo could be taken at a bad angle or the cat could just look like a dog. As a result the image points will appear in the incorrect region Fitting the model with a high value of gamma we get the following results, the performs almost perfect on the training points We can represent the classifier with the following decision region, where every point is classified by the color accordingly this does not match our Decision regions; this is called overfitting where we do well on the training samples but we may do poorly when we encounter new Data. To avoid this we find the best value of gamma by using validation data we split the data into training and validation sets, we use the validation samples to find the Hyperparameters we test the model for a gamma of 0.5, we get the following misclassified samples, we see a value for gamma of 0.1 performs better, as a result, we use a value of 0.1 In practice we try several different values of Gamma and select the value that does the best on the validation data SVM’s work by finding the Maximum Margin Witch of the three planes do you think perform better in classifying the data? Intuitively you would say the green line, you would be correct as changes in the dataset caused by noise would not affect the classification accuracy of the green line How do we find the best line? Basically, SVMs are based on the idea of finding a plane that best divides a dataset into two classes, as shown here. As we’re in a 2-dimensional space, you can think of the hyperplane as a line that linearly separates the blue points from the red points. One reasonable choice as the best hyperplane is the one that represents the largest separation, or margin, between the two classes. So, the goal is to choose a hyperplane with as big a margin as possible. Examples closest to the hyperplane are support vectors. It is intuitive that only support vectors matter for achieving our goal; and thus, other training examples can be ignored. We try to find the hyperplane in such a way that it has the maximum distance to support vectors. Please note, that the hyperplane and boundary decision lines have their own equations. So, finding the optimized hyperplane can be formalized using an equation which involves quite a bit more math, so we are not going to go through it here, in detail. That said, the hyperplane is learned from training data using an optimization procedure that maximizes the margin; and like many other problems, this optimization problem can also be solved by gradient descent, When the classes are not separable the soft margin SVM can be used. This is usually controlled by the regularization parameter. This allows some samples to be misclassified We select gamma and the regulation parameter C by using the values that do best on the validation data, check out the labs for more