Okay, so far we talked about the relation between machine learning and AI, the broad landscape of machine learning methods. And differences between machine learning in finance and machine learning in tech. Now it's time to get a bit deeper and see how machine learning works in more details. Let's start with the most important goal of machine learning, namely the goal of generalization. Let's come back to a diagram that shows the process of learning by an agent that we introduced in the last lecture. Here we have a machine learning agent on the left that interacts with the environment on the right. The agent perceives its environment and knows about it in order to perform better its actions to achieve its goals. For example, your agent might be an algorithm that computes the probability of default on a credit card by a given customer. The goal is then to make the prediction as accurate as possible when averaged over many cardholders. And this is a shift through learning. The algorithm learns useful information from available data in order to perform well on the new unseen data. This ability to perform well on a new data is called generalization. And this ability is an ultimate goal of learning in the context of machine learning. Now let's try to quantify the notion of a generalization ability. Let's consider a classical regression problem where we have a vector of predictors, or features, x, and a scale of y, which we believe is driven by these predictors. For example, y can be the return of a particular stock. And x can be a vector of market indices, such as S&P 500, Dow Jones Industrial Average, NASDAQ composite index, market sentiment indices, and so on. We want to find a function y = f(x) + epsilon that will predict the value of y given the value of x. Here epsilon is a random error term with 0 mean and variance sigma squared. We want to find an approximation to function f(x) that generalizes well. What does it mean? It means that this approximation should minimize the square loss function y-f(x) squared for all data, both seen and unseen. And the problem with medical object for this minimization would therefore be this expression, this expectation of the square loss function shows here. The symbol e here means that we take the expectation with respect to all possible combinations, so f(x) and epsilon, that might be encountered in our new, yet to be seen, data. So once we defined an objective function, let's see what happens if we try to minimize it. First we evaluate the square and replace the expectation of the sum by the sum of expectations. Second, we express expectations of the square as a sum of variance and the square of expectations. And use the original equation to evaluate the last term. And next, we combine all terms to get the final formula. Now what we got has a special name, and it's called the bias-variance decomposition. This is the previous formula written in prose, let's read it again. It says that a generalization error for a regression problem is equal to the sum of bias squared, variance, and noise. Let's go over these terms one by one. The bias square is the square of expected differences between the approximate predictor and the true predictor. The variance measures the sensitivity of the estimator to the choice of data. Note that this is only a function of hat, sorry, of f hat but not of f. Finally, the noise term doesn't depend on f hat or f at all. It's a property of data which is beyond our control. Now, it's important to note that the bias-variance decomposition is mostly of a theoretical value. Indeed, we don't know the true predictor f(x), and we don't know how to compute expectation values appearing in the bias and variance terms. However, it shows two distinct patterns by which the generalization error of your model can deteriorate. Your model can have a large bias, or it can have a large variance. Moreover, in most practical situations, there is a general trend for machine learning algorithms to have a kind of strong negative correlation between the bias and the variance. This is called the Bias-Variance, the Bias-Variance tradeoff. To reduce bias, you might be willing to consider more complex models that incorporate more features. These tend to reduce the bias because your model is more flexible, it has more capacity to adjust to the data. However, the flip side of this is that adding more features would generally increase variance. Vice versa, you can use a simple linear model for your regression with a low number of features. Then your model will have a low variance, but it would have a large bias. Therefore, building the right model for your problem requires finding the right level of model complexity that matches the data complexity. But maybe the right level of model complexity and model architecture could be established beforehand guided by some characteristics of the data. For example, its dimension but not the data itself. Imagine, for example, how simple the life would be if we could have simple rules like, use a random forest algorithm for classification if the dimension of your data is below 100. And use neural networks if it's above. In the next video, we will check whether these ideas can fly.