One of the basic problems to [inaudible] learning is prediction problem. We have some information. We know something and we want to use this information to predict some quantity that we don't know. Let us consider an example. Let us assume that we are cafe and we sell pancakes. What we are interested in is how many pancakes should we prepare. It is determined by the demand. How many pancakes could we sell at a particular day. Of course this is a random value because on different days may have different clients and some of them can be hungry and some of them can be not hungry and some of them like pancakes and so on. So we have to use random variables to model this demand and what we want is to predict this demand in some way. What is the best way to predict the value of this random variable? So let us assume that X is our random variable, which is demand on pancakes. How many pancakes can we sell at a particular day? We want to predict the value of this random variable. Let us denote our prediction by x hat. Let us assume that we choose this prediction once and forever. How can we measure the quality of our prediction? Of course, we have to consider the difference between the prediction and the actual demand at a particular day. This difference can be either positive or negative, but both is bad for us. For example, if we underestimate the demand, it means that we don't have pancakes and we have some clients who want pancakes, but doesn't help them and it is bad for us because we lose some money on these clients. On the other hand, if we prepared more pancakes that we can sell, then it means that we again lose money because we have these pancakes and nobody wants to buy them and we have to throw them away. Even more in some cases, it is a good idea to consider note this difference as our loss, but a score of this difference. For example, if we didn't prepare enough pancakes, we lose some money because we cannot sell the pancakes to the people who want them and also these people can become angry and they can give us bad remarks on some websites and so on. So it is a good idea sometimes to consider our penalty as to be score of our error. This is actually a popular loss function [inaudible] learning which is called squared error. Look at this formula. We have X here. X is a random variable. So the whole formula defines a new random variable. At different days, we'll have different value of this loss. For example at one day, our prediction can be good if we are lucky enough and on the other day for example we have demand which is much larger than we predicted, just due to chance and we have large value of this loss. What we usually want is to make our predictions good at average. It means that we have to find an expected value of this loss. Now this value is not a random variable because we have expected value here. This is just a number that depends on x hat. Let me denote this number as L of x hat. L stands for loss. Now our goal is to minimize on this function L of variable x hat by choosing appropriate x hat. How can we find such optimal x hat? Let us use calculus to do it. First of all, let us consider the distribution of random variable X. We assume that X takes values x1, x2 and so on, xn with probabilities P1, P2 and so on, Pn. However, we are interested not in the random variable X itself, but we are interested in our squared error loss. For each value of X, squared error loss is calculated according to those formula. So let us find the distribution for squared error loss. It means that we just have to substitute the corresponding value here instead of X capital and get the following. So for example when x takes value at x_1, and this happens with probability P_1. Then our squared error loss takes value of x_1 minus x hat, and they have to take a square over this. The same thing happens for all other possible values of X capital. Now, we can find expected value of this thing. So our l of x hat. To do so, we have to get these values, multiply them by these probabilities, and make a summation. Now, we want to minimize this function with respect to this variable, x hat. As we know from calculus, to find minimum or maximum of some function, we can use derivative. So let us find the derivative of this expression. To find the derivative of sum, we have to find the derivative of each component of this sum. So we have the following. Now, this Pi is a constant, so we can move it out of the derivative sign. To find a derivative of this thing, we can use the well-known rules and get the following. This is the sum of two. We have to put a negative sign here according to the chain rule. But it is not important now. We have to multiply this by Pi. So this is our derivative. We see that this function is actually a quadratic function of x hat, and the corresponding graph is parabola, and at the only point where the derivative is equal to 0, we have a minimum value of this function. So we have to find the value where this expression is 0. Let us do it. First of all, let us get rid of this 2, and this negative sign, because they are just constant, and we can divide this equation by the constant. So we have the following sum. Now we can expand this sum. In this sum, x hat is a constant, and we can move it out of this summation sign. We also move this term to the right-hand side of the equation. Here, we see this summation of all possible values Pi. We know that this sum have to be equal to 1. Here, we see the summation of products of values that our random variable x can take by the corresponding probabilities. So it is the expected value of X. Now we see that the optimal x hat equals to the expected value of X. This is a really important result. If you want to predict the value of a random variable, and have squared error loss, then you can chose expected value of this variable as your prediction. We will see in the future that if we choose a different loss, we will get a different optimal value. But squared error loss is very popular due to good mathematical properties. So this is the first and a simplest example of solving of prediction problem. Unfortunately, all this story is purely imaginary, because to find the expected value of random variable, you have to know it's distribution. In reality, we usually don't have a distribution of random variable that we are interested in. We have only some data that assemble it from this random variable. However, these information about the importance of expected value gives us some hints on what to do in the general case.