Hello everyone. In this video, we're going to talk about Support Vector Machine. Let's review briefly. In machine learning, we have different learning tasks. In this class, we focus on supervised learning, that means given the data, we would like to predict the labels. This prediction task have two different categories such as : regression and classification. Regression means that the prediction value would be real-valued whereas classification, the prediction value would be the categories. We talked about binary class classification and multi-class classification, and according to these different tasks, there are different models that we can apply. For example Linear Regression applies to regression problems and Logistic Regression although the name says it regression, it is for binary class classification. We talked about, we can generalize logistic regression using Softmax. Then we can do the multiclass classification or we can apply a logistic regression model to do the multiclass classification. If we choose one class versus the other ones. Then we moved on to non-parametric models, such as a KNN neighbor and decision trees. KNN neighbor doesn't have a parameters unlike linear regression or logistic regression. It is one of the most simplest model in machine learning. It can do both regression and classification. Decision trees are weak learners, but it's very flexible and it's easy to interpret. It can also do regression and classification. Also we talked about Ensemble method, which can apply to any model. However, it is most beneficial for decision trees because decision trees are weak learners and by Ensembling them, they can be a strong learner. So for example, we talked about parallel ensemble method, which is random forest, which we grow the trees in a decorrelated way and then average them. Another method that we talked about was serial ensembling method, which is a boosting method. Instead of growing the full tree, we let them grow very slowly and small one at a time. We talked about adding a stump which has a Warner just a few decision splits and then we add it to the edit them with some learning rate. The rest of the class we'll talk about SVM, which is another powerful non-parametric model. There are some other supervised learning models that can perform well, such as a neural network. However, you won't have a time to go deeply into neural network in this course, so we'll skip that. Let's briefly talk about hyperparameters and what's the criteria? Very bit in depth. Linear regression, there was no hyperparameters. But we need to design in the feature space, how many features we want to include, how many higher-order terms that we want to include. That is domain of more feature engineering, but it can be a design consideration. Linear regression has parameters. So W1 X1 plus W2 X2 plus intercept, all these w's are parameters. Loss function for linear regression. We talked about MSE loss and similarly RSS. Those are loss functions that we use. Logistic regression is very similar to linear regression, except that it has a sigmoid function that threshold the probability at the end, so there is no hyperparameter, and again, there is a design considerations such as how many features that we want to include and how many higher-order terms that we want to include and parameters they're the same. We have the same form of this and then there is a sigmoid threshold at the end, but these are the parameters and very much same as linear regression. For loss function in logistic regression uses a binary cross entropy. In KNN, the K is the hyperparameter. K-means number of neighbors that we want to consider when you decide whether a point or around the some other points are certain class. There is no parameter because KNN is a non-parametric model. There's no loss-function because there is an optimization going on. However, there is some rule on how do you decide, so when there are neighbors like this. Then this point here would be having more neighbors around this with this x class. You will classify this x. In KNN, to determine which neighbors are close by, uses a distance metrics such as a Euclidean distance. KNN doesn't have loss-function for no optimization however it uses a distance metric in order to make a decision. In decision tress, is again, non-parametric models. There is no parameters, therefore there is no optimization. However, decision trees have hyperparameters such as max depths. What's the minimum samples in the terminal node and things like that? As optionally, if you were to do some pruning, there was something called the CCP out bar, which is at the threshold of pruning criteria. There was no parameter for decision trees because it doesn't have explicit optimization process. However, it requires some criteria for splitting. If you remember when the samples in one box, when split, the decision tree models go through all these features and pick the split value of that feature, which that minimize this criteria function. This criteria function was something like Gini index and entropy for classification and MSE or RSS for regression tasks. Then we also talked about ensembling method that derives from this decision trees. Ensembling methods, they all share similar hyperparameters as decision trees , and on top of that, they have additional hyperparameters such as number of trees. Because it's going to ensemble several number of trees. Or for boosting, it can have also learning rate. Again, there is no parameters for this ensembling method. The criteria function decision, split criteria, they have the same criteria functions as decision trees. In SVM, that we're going to talk about, there is one hyper parameter called the C parameter, which we'll talk about what's the role of the C parameter is, and there is no parameter, because SVM is also non-parametric method. However, SVM has an internally, have some optimization process. Neural networks, although we're not going to talk about deeply here, they have both parameters, and hyper parameters, and loss functions as well. Let's talk about support vector machine. So here are some few facts about support vector machine: Uses a hyperplane to make a decision boundary. We'll talk about it more later in this lecture; Uses a kernel, which is a function that applies on feature space, and especially useful when we deal with the high dimensional feature space, such as the images or texts. For example, instead of doing feature engineering on image pixels, we can apply some functions such as finding similarity between some pixel patches then that way we can save some computation. Because of that, support vector machine was widely used and developed during the '90s, before the neural network became very popular. It uses some mathematical color tricks to deal with the high-dimensional data such as images; It is one of the high-performing off the shelf machine learning method. So all of the tree ensemble methods support vector machine in a neural network. They are popular high-performing method; Support vector machines can do regression and classification, and especially it works natively on binary class classification. However, we can also use one versus the other method to do the multi-class classification. Let's talk about binary class classification. It is essentially yes or no problem. For example, it could be some problem like whether it is critical the user will pay the debt or not? Or does the insurance claim is a fraudulent or not? Or maybe this email is spam or not? It can be medical diagnosis problem, whether this patient has certain disease or not, whether the patient will survive or not? Whether this customer will continue for the service or not? As you know already, the binary class classification can take any data format, as long as the label is yes or no. For example, image recognition can be binary class classification. Whether the off-take in the driving scene is a pedestrian or not, something like that. We can also do binary class classification on text data such as sentiment analysis. Previously, we talked about logistic regression as a simplest model to do the binary class classification. As you know, this curve is a representation of a probability, which is actually a sigmoid function as a function of Z. This is a z, and is called logic and described by this linear combination of feature x with the weights and bias, like in the linear regression. When z is 0, the probability of the sigmoid function becomes 0.5. Therefore, it becomes a decision boundary. Previously, we talked about this, decision boundary can be a [inaudible] point when it's only one dimensional feature space. Or it can be aligned like this when it's a two-dimensional feature space. It can be a plane in the three-dimensional space or hyperplane when it's a multidimensional space. Now, you know what the hyperplane is. Now, the question is, how do we find this hyperplane that becomes a decision boundary using SVM? We would like to find the hyperplane that separates the data points according to the right class, like this. But depending on how the data points are distributed, there could be more than one way to separate those data points. For example, this can be a perfect choice, but also this can be a good choice. This hyperplane can also separate the data perfectly. The question is which hyperplane should we choose? Now, we're going to introduce a classifier called a maximum margin classifier, and sometimes it is just called hard-margin SVM. One thing that we can consider is that we want to train our model, such that you can generalize better. That means if we have another new data point like this, our model should be able to classify that correctly. In other words, we would like to have a hyperplane that's less likely to misclassify the new data, and how can you achieve that? We can select a hyperplane that has a biggest margin. Let's see what that means. Here is the data again. Let's say this is the hyperplane, and this point are closest to the hyperplane. Those are called support. The distance between the hyperplane to those support closest to the points are called margins. These are margins. The maximum margin classifier learns how to maximize the distance between the hyperplane and this supports. Let's talk about how the maximum margin classifier finds a hyperplane. Initially, because it doesn't know the right hyperplane, it's going to look like this. It's a randomly chosen hyperplane which makes these pointer the wrong side to the margin. When data points are wrong side of margin, you will make the loss-function bigger. The optimizer in the SVM will try to reduce this error, so it will adjust the coefficients of the hyperplane equation. Now, the hyperplane looks like this. We still find the data points that are wrong side of the margin. But it is a smaller error compared to the previous one. Smaller loss-function. Again, the optimizer will try to reduce the error and updates its hyperplane, and that look like this. When we go this iteration over and over again, finally, the hyperplane will be optimized such that the margin between the supports are maximized. Here is our short quiz. What happens to the separating hyperplane if we had a new data points? The answer is that, it depends where the data points get added. For example, if the new data point like these are added outside of the margin, it will not do anything about the hyperplane. However, if the data points are added inside of margin or even the wrong side of the margin, the hyperplane must change. Let's say we have new data points like this, and obviously it's a wrong side of the margin. The blue points should be upper to the hyperplane, however, this new data point is the wrong side below the hyperplane. In that case, the hard-margin classifier will try to fix this, so you will have to change the hyperplane like this. But not only that the hard-margin classifier is sensitive to the new data point, sometimes it's impossible to use. Like you see in this graph, the data points are inseparable. When we have an inseparable data, the hard-margin classifier, in other words, the maximum margin classifier on work, so we need to relax the condition of having hard margin. Therefore, another method called the soft-margin classifier can be useful in this case. We'll talk about that in the next video.