[MUSIC] Now, I'm going to describe linear regression which is important, because it often works very well, it's simple and it forms the basis for more advanced methods. Now the basic assumption of linear regression is that the predictor function f, which is a function of our input data, is linear, which means it takes this form. Now if you want to think in terms of gene expression, such that X, the gene expression profile from a sample, then our prediction of the physiological variables say survival is given by a constant beta zero plus a weighted sum of the gene expression values where the weights are the parameters Bi Then our job is to find the parameters beta, which make the predictor function which optimally predicts the value of the physiological parameter. Now, the way we do this in terms of our general formulation of regression is to minimize the loss function. The loss function that we'll use is the sum of squares, which means that we take each input datum, each output datum, and find the difference between that value and the prediction of that value. And then we square it, and then we add up the sum of all of our input data. If we arrange our input data in a matrix, such that we have n rows for each of our data set values and a column for each of the features, the genes, say, so that we have an array of numbers such that the ith row and the jth column contains the gene expression value of gene j in sample i, then we can write the optimal parameters, beta, in this form. Now to give you a more visual impression of what we do in linear regression, we'll consider a simple example. Let's say that we have only one feature, so that p is equal to 1, which means we have only one input variable x, which we will plot along the horizontal. And we have only one output variable y, which we'll plot along the vertical. Then we can plot aour data set as a scatter plot. Then the fundamental assumption of linear regression is that the relationship between the output variable and the input variable is linear. And then our job is to find the straight line which optimally fits the data, which we can draw, like this. And by optimal we mean the one that minimizes the sum of the squares of the distances of each data point from the line. So that's what it looks like in two dimensions This generalizes to three dimensions. Let's say we have two input variables which we will plot along this axis and this axis and then the output variable which we'll plot on the vertical. Then we can plot just in an analogous fashion we can plot our data as a scatter plot with each data point. And linear regression then tries to find the plane which optimally fits the data. Now I'm going to describe linear methods for classification. Now, the classification problem is the one in which we are given some input data and each of the input data is associated with one of a finite set of classes, which we'll call G. Now, we can think of our input data as occupying some kind of space. Now let's say our input data has two variables, x1 and x2. You might like to think of the expression of a pair of genes. Then each datum corresponds to a point in the space. Now, each datum may also be associated with one of two classes, the red and the blue. Well, you may like to think of this as being a diseased and a normal state, for example. So in this way, we can represent our data in a two-dimensional space. Then, the aim of classification, in this example, is to find the best way of dividing this space such that each region of the space is optimally associated with a class. A linear method for classification draws linear boundaries, so that looks like, like this. So what we've with a linear classification is we find the linear division, which is a straight line in this two-dimensional space such that, all points of the space on one side of the line, are predicted to be from one class, and all points on the other side of the line are predicted to be in the other class. So we say, any point in this blue region is predicted to be from the normal state, and every point in this red region is predicted to be from the diseased state. And linear methods of classification are ones which calculate the optimal method, the optimal division of the space in order to predict the class. A very simple, non-linear alternative to the classification problem is the nearest neighbor classification method. In this approach, if we're given a data value we would make a prediction based on the nearest neighbors in our training data set. So let's say we're given a pair of values indicated by the position of my cursor here on the screen. Then we would take the nearest K neighbors, let's say K is three, which would be these three datum values. And now these three all belong to the blue class. So for this point, we would make the prediction that the class should be blue. And you can do this for every single point in the space. And when you color the points in this example, you get a division which has a very irregular shape.