[MUSIC] Moving on from linear methods, we can now begin to describe a non-linear method for regression and a notable example of this kind of approach is the Artificial Neural Network. This is an approach which was inspired by the working of a brain and it does this in a sense. The data is, includes a layer of hidden variables. Now just as the neuron in a brain takes a number of inputs and then decides whether to fire or not, so it's kind of binary, these hidden variables take a number of inputs and decide whether to be on or off. And it's from these hidden variables that we eventually make a prediction of the output. So the overall structure and neural network is illustrated schematically on the right. It's made up of several layers, on the bottom we have the input variables. And in terms of these, we define the hidden variables, these are the neurons. And it's from the state of the neurons we then finally make our predictions of the output variables. So at the very bottom we have our Input Variables, our input data. Then by taking Linear Combinations of these variables and applying a sigmoidal function such that It's a kind of a smooth diversion of our on or off state. We arrive at the values of the hidden variables, or the neurons. So these typically are in an off or an on state, depending on the linear combination of the input variables. Then, finally, from this state of the neurons, we take a non-linear function to make a prediction of the output variables. Now, the parameters of this model are the parameters of the linear combination here, and the parameters of the non linear function here and these parameters are fit by a numerical process called back propagation. Now this overall scheme of our regression model is very successful very accurate and a very wide number of situations for support vector machine is perhaps the most widely used method of classification. In its essence, it's a linear method though it can be extended into a non linear method. So it's essentially linear because the method of support vector machine is defined the optimal separating hyperplane which is to say, the linear boundary which divides the space. In a way that optimally predicts the class of the input. So earlier in this talk, I showed you an example of a linear method for classification, whereby the space was divided into two with a linear boundary. The idea of the support vector machine is that the optimal boundary is the one which minimizes the margin. Now the margin is defined to be the distance of the nearest data point to their separating boundary. What an important property of this is, that when you choose the optimal separating hyperplane, the one that has the maximum margin, only the data points which are closest to the boundary actually contribute to that decision directly. These data points are called the Support Vectors. In situations where it's not possible to complete the data there, you can use a process called basis expansion, which can also be used transform this support vector machine method into a nonlinear approach. With a basis expansion we increase our number of input variables by defining an extra set of variables which are non-linear functions of the import variables. Support Vector Machines are very popular, widely used, because they are often highly accurate, and they're computationally very efficient because we can calculate the optimal separate in support back to machine hyperplane based only on, inner products has defined by kernal. Finally we're going to think just a little bit about the assessment of machine-learning models. That is to say how do we assess the performance of a model and select the best one? So initially, we might choose the machine learning model. That has the smallest prediction error, the one, whereby, we predict the value of the output and choose the model that minimizes the difference between our prediction and our training data. This is called the Training Error. But the real test of how good the model is, is how well it performs on data that its never seen before, which is to say how well does a model generalize. For this we need to estimate what is called the test error. Now the problem is that by making the model sufficiently complex, it is always possible to achieve zero training error. But the problem is as you do that, you can do that, but when such a model receives data that it's not seen before, it will likely have a large error. So we need to strike a balance between making a model that is sufficiently complex to predict the variation in the data but not so complex that we have simply fitted the noise in the data, what is called over-fitting the data. So, finding the way to select the model that generalizes the best and produces the smallest test error is the aim of machine learning. So next time, we're going to describe one approach to estimating the test error. A very simple and commonly used approach to estimating the test error, and then thereby estimating the accuracy of the model and its generalization is K-Fold Cross Validation. What we do here, is we take all the data that we have and we divide it into a number K of equally sized partitions. Then we reserve one of those partitions for later and take the rest to train our machine learning model. Then, we bring back in the reserved data set and use it to estimate the error. This way, we get an estimate of the error in our model based on data that has not been used to train the model. So we get an estimate of how generalizable the model is. And we can repeat the process using each successive fold as the test fold and we can average the results. [MUSIC]