On this session, we're going to look at several different machine learning methods. It's interesting that it's very easy to find techniques available now in Python, R, and other languages. So it's very easy to apply these in particular. We've proved promising for the area of crash regime are, K-Nearest neighbors, decision trees, and boosting. We're going to discuss that today. Then we're going to look at an ensemble forecasts, we'll take these together and try to find the best of the best to that. Let's go back to an example that we presented earlier in the online course. Here, we're going to look at who are the people who default on their loans during crash periods? In particular, we're going to look at two vulnerability indices. On the vertical axis we have one, on the second axis we have another. These will be things such as the amount of loans that people have relative to their income, the vulnerability indices. The red dots are those people who defaulted, and the yellow dots are those people who did not default. So as people are more vulnerable, crash cons are more likely to default. So we want to find a way to separate so we can then find groups. This is our test data, we want to add it, and then evaluate how we form these boundaries between these groups. The first approach we're going to look at is called K-Nearest neighbors. So you can go back to our original picture, and we can imagine that you have a new point in that graph. That point is going to depend on what people did in that same neighborhood. In particular, K is going to be the number of people we look at around that person or that firm, and we want to know who defaulted in that context. So we're just going to do a simple voting. So let's say K is 15. We're going to look at the 15 nearest neighbors, and we're going to say that if eight of those defaulted, then that person eight or more defaulted then that will be a default. If it turns out to be seven or less, then in that context we would say that person will not default. We're going to set that and K will be our tuning parameters. So we want to know what will be the best tuning parameter given our training test method. You'll want to go back to our old approach, trying to test and we're going to set the K to give us the best out of sample analysis. What's the best K? Here we look at some examples. So one of the important inputs to K-Nearest neighbors is how you define distance. We see two approaches. On the left, we see Euclidean distance, the L-2, and on the right we see L-1, the absolute deviation, sometimes called the Manhattan distance. In particular, we then use that approach finding the best set of Ks. We see here K equals 15, which turned out to be one of the best results out of sample. We see a fairly nice pattern between the people default and those who don't. So a new person comes in, we look at them and say well, if they're in the pink side, they will end up defaulting, and if they're on the blue side, they'll end up not defaulting. If we set K equals to one, then we can identify who's the nearest neighbor, and using our training data, we'll find that it would be perfect. In other words, if I know what the person defaulted was, their neighbors themselves, so therefore they'll end up defaulting. So on the training data that's the best, but not necessarily on the out of sample data. Typically, the K equals one is not going to be the best. Turns out that K higher numbers tends to be better in this case, the 15. So the second approach we're looking at is called decision trees. In this context, we're going to take and would divide up the population, the test population that we're looking at into vertical rectangles. You can think of vertical rectangles. We're going to do this one step at a time. We're going to find the feature and the location that feature that separates the population in the best way in terms of our loss function. How many people are on one side versus the other side? We're going to find the best split, the split that gives us the best estimate across those groups. We're going to find that there's many different techniques here for pruning the tree, how did the tree should go. We're not going to go into that now. The advantage is that, in the decision trees we can interpret the results. Here's a simple example. Let's first look at the right side and then we'll look back at the left side of the little box. On the right, far right tells us, here's the splitting. So in this context, we take the first split, we find that the most significant split would be this split, the population vertically left and right. So X_1 is above or below a threshold. If it's below the threshold, then we drop down to the next part of the tree, the next node in the tree, and we split on a horizontal basis. So we see then on the left panel a split horizontally. The next step on the right then we want to split again and it turns out we're going to split secondly on the x-axis. So you see that in the x-axis we have two panels, two vertical lines on the right side. Then as we go deeper into tray, we split again and we see now we have horizontal packets, a horizontal split. So decision trees have this vertical and horizontal splitting one by one. So in other words, we cannot split as we see on the left here, where we have these overlapping nonlinear time relationships. We assume that if you're in a leaf of the tree at the very bottom, then you either default or you don't default, Y equals one or Y equals zero. So this is an approach which is one of the earliest methods in machine learning, is a very simple approach. This third method is called boosting. This method is a variation on the other. So you can think of let's say take a decision tree, or any other method, many other methods, and so we're going to here going to look at taking those results one by one, we split and then we solve it. But we're going to do it again, so we come up with a decision tree let's say, and now we're going to do it one more time, but in this case we're going to change the loss function a bit. We're going to look at the errors that have occurred when we have people on the wrong side of that divide, and we're going to change the loss function a little bit, so we penalize higher the ones that we made mistakes on. In this context, we will do this several times, multiple times, and we'll then have another go through that whole process. Again, let's stay with the decision trees. We'll do it again but this time, each step we will change the loss function. Do this let say 10 times, and in particular, now we're going to do a vote. We're going to take those 10 possible solutions, and we're going to do a voting or an ensemble type forecasts to that. So gradient boosting has often some of the best out of sample loss functions, the lowest loss function. But it can be very difficult to understand. So here we see in the graph, as a very non-smooth curve. In particular, we see that the pink lines tend to have variations, and it's harder to interpret that in this context. It doesn't make sense maybe to some people. So let's now summarize these three methods. We've talked about classification with discrete variables, and we estimated what is going to be a crash or not. We looked at three methods that are pretty intuitive, I think K-nearest neighbors, given some historical data. Decision trees, easy to understand. Then boosting which is a variation on these other themes. We know that if we want to apply machine learning in an effective way, we need to have tuning parameters. So we're going to have to find it as tuning parameters for that. So we're going to have to have enough data, oracle data to do this tunings so we can do our train and test over and over again. In particular, we're going to save some data which we call validation data to do our final analysis. So we will do that next time.