So in this section, we're going to go back to our techniques in machine learning, and we're going to evaluate those in the context of our zero, one discrete variables. So one of the key issues in machine learning is to identify tuning parameters and particular, to set those tuning parameters that give us the best out-of-sample analysis estimates. In particular, we're going to look at loss functions, and we're going to say how well does a particular tuning parameter do on a particular set of out-of-sample data. This leads us to be careful about not overfitting. In the traditional regression if we use all the data, we can typically overfit and thereby not be very good at out-of-sample testing. Let's go back to the topic that we looked at in previous modules, which is the issue of regularized regression, and we can do this in traditional regression or logistic regression. In this context, we look at a penalty term. We add a penalty term to the loss function, the maximum likelihood to fit in a regression. So we see here a sum over observations of a parameter Lambda, which is our tuning parameter, times the distance between how large the Betas are. So we're going to penalize Betas are very large or very small, and we're going to push that down in particular. In this case, we have absolute value, that's called lasso function and it will then affect as Lambdas get larger and larger, it'll push and shrink the variables down, the Lambdas down. This tuning parameter, Lambda is typically set through a cross-validation. But the basic idea is that, we take this new objective function, the traditional regression equation that minimize a lot, the distance is the losses, and we're going to do this in a form where we save data that we'll set aside. We're going to take part of it out and that's going to be the testing data, and the part that is not taken out will be the training data, and we're going to then do our analysis on the training data and we'll look at how well our model performed with different Lambdas over that test set of data. Those are figure that we showed previously in previous MOOCs. You can see that as you train harder and harder, the loss function gets lower and lower. However, when you go to out-of-sample, which is the red line here, you see that if you overfit, then when we try to estimate out-of-sample would do much worse. We want to find that middle ground, where we set that tuning parameter often called the complexity parameter, such that it is somewhere in between. We want to minimize the out-of-sample analysis. If you recall we looked at cross-validation previously, where we take the set of data and we then divide it up into groups. This is called in this case, k groups or what's called k-folds. In this case, we'll take out let's say we have a five-fold. We would take out 20 percent of the data, then we would train it on that 80 percent remaining, we'd look at how it does on the 20 percent. Then we take out another group. Second line there, we would take out the middle 20 percent separately and we train it on the part we've left behind. Do that five times and we do that for every set of Lambdas. So we want to find the Lambda that gives us the best out-of-sample analysis on the test data. So remember we separate the training and the testing, we've called this train and test. Training on one set and we look to see how it does on the other part, and we find the Lambda, the tuning parameter, that gives us the best estimate. So you can think of this as practice, practice, practice. We're going to train and then we're going to test, and we're going to do this over and over again. In this case, five times or k times, and we're going to find that tuning parameter that gives us the best estimate, and that's the essence of cross validation. Now, there are some other approaches for addressing these questions. If you recall by shrinking the Betas down, many of the Betas will drop to zero. So in some cases, if we have a large set of features and we want to then identify the most important features, we can find that by studying different lambdas. As lambda gets bigger and bigger, we'll have shrunk the Betas down to smaller numbers of non zeros and the ones that remain are the strong ones, those are the ones who are going to be the most important. Now, there are some other methods for performing this same type of analysis. One is to actually add a different type of regularization term. Here we have two, we have the L1 norm and we have an L2 norm. We can then have two parameters, Lambda 1 and Lambda 2. It gives us more parameters to evaluate, to find the best analysis out of sample. This has the name elastic net and it has some benefit when the explanatory variables are correlated. So we have that approach. Another approach to finding features, is to actually do a sequential process, what's called stepwise regression. This gone back to many decades and historically accurate in some ways, that you would take out you'd find that data that feature that's most important, you put that in and you look for the next important feature and you do this in a sequential manner, that's called forward stepwise regression. Now with very powerful computers, we can look at what's called best subset selection. So instead of doing this in a sequential manner, we can find the best set of l parameters out of a larger set of features. We can find the best group, and we can do that sequentially, where l is one, two, three, four. That requires a lot of computational results, and we can't say too much in general about what's best, but we do know that empirical testing is going to really drive this in a big way. In particular, in our case, we're going to do a third step, which is to save some data which we call the validation data, and this is data that we've not looked at for our train and test. So we've taken some other data out, and we're going to do our train and test with the data remaining and then we're going to look at how it does in validation. So you can remember that when we get to the next sections, we want to look at how we do with the validation errors. Another important issue with machine learning is to take these techniques and find an ensemble, which is the best of the best. We can take a group of methods and we can find the ones that give us the best result overall. So this is by you could think of it as, we're taking weak classifiers, we're putting together, and we get a strong classifier, a much better estimate as we go along. Now given that in many cases we have large numbers of features, there could be hundreds of features in econometrics. Many lags we can get large numbers, in the case of health care, we can get millions of possible features, we might have millions of people. This requires quite a lot of computations, and this is going to require us then to deal with how we find the best solutions for these problems when you have large numbers. So the take home points today are as follows. First of all, machine-learning is a work in progress. In particular, it works very well in some cases, and when we especially when we have a large enough database and we have high-performance computers. Secondly, we use this train and test procedure over and over again to see how well we do on out-of-sample data. There are some theories that we could use to tell us to guide our approach for this. However, in many cases, it depends on the domain specific results, very empirical. Secondly, in the case of investment management, this is complicated because of first of all a lack of data, lower amounts of data, we don't have millions of quarters, millions of months to look at. Conditions can change, so we have to worry about the fact that something costs one recession may not cause a different recession and that's partially because of behavioral issues, people tend to have emotion at times and that becomes reality when people see something, they start to act in a way that causes them to change their behavior in certain environments. Secondly, this idea of micro data, which we're going to talk about a bit later, micro-level data is just becoming to be mainstream. So in other words, looking at individuals, how they speak, how they feel, how they invest in, it's going to have an impact on the overall economy. That's going to be partially how we're going to look at this problem as we go along, and so we're going to take the traditional econometric approaches and we're going to add them to machine learning, which is aimed at individuals. Also, another complication is the fact that regulations can change. We look back at those old charts, we saw in 1979 that the Paul Volcker who became the head of the Federal Reserve, changed the way interest rates were managed. Up to that point, we saw a really high increase in interest rates. After that, we saw a very different approach and ever since then, we've had interests dropped dramatically. So regulations can have an impact as well on this.