We're moving towards our model training and testing. Let's continue preparing for this modeling process. Just to be clear, let's reset the stage. The goal here is some sense to estimate a function f using a bunch of input variables or predictors or explanatory variables that will help classify our outcome variable Y as either investment grade or speculative grade. Y is our outcome variable. It's 1 if investment grade, 0 otherwise. It could be 1 and minus 1, that doesn't matter. All these x's here are our model inputs, are predictors explanatory variables. So from the last video, we are starting or starting point for this exercise will be all the credit risk KPIs that we discussed at the outset to get an economic understanding of what corporate credit risk is and how we measure it. Now that said, I'm going to take a look at the correlation matrix for all of these predictors, and I've done so in a heat-map format. So let me explain what this colorful picture shows. Each row and each column corresponds to a different combination of variables. The first row corresponds to the current ratio, the first column, the current ratio, second row quick ratio, second column quick ratio, and so on and so on and so on. Each number in the matrix corresponds to the correlation between a pair of variables. Now you'll notice that the diagonal on the diagonal they're nothing but once, because those represent the correlation between a variable with itself. So the correlation between the current ratio and the current ratio is 1, as is the correlation between the assets to equity ratio and the assets to equity ratio. The coloring is just there to help identify stronger or weaker correlation. Stronger positive correlations are identified by darker colors. Here's the scale one being the strongest, stronger negative correlations by lighter colors. We don't have anything perfectly negatively correlated, but we do have some correlations that are less than negative 0.4. Even negative 0.5 as we see the debt-to-asset ratio and interest coverage ratio. Now, why am I showing you this? Well, I labeled this slide redundancy with a question mark because when we first introduced these measures, we bucketed them by what they were trying to measure. So that first group of liquidity measures the current ratio, quick ratio, and cash ratio. They're all trying to get measures of liquidity. Now not surprisingly, they're also positively correlated, and from looking at the numbers as well as the darkness of the shading, they are very strongly positively correlated. Likewise, we also have some measures that are very strongly negatively correlated. When we think about the debt to capitalization ratio relative to the interest coverage ratio or debt-to-assets to interests coverage, we've got debt in the numerator and debt in the denominator. So there's a very strong negative correlation there. The point of this exercise is to suggest the idea that, well, we can throw in all of these variables into the model and let it pass through what works and what doesn't. That's not always the best strategy. There's this notion of model parsimony or simplicity that is very important and quite useful when it comes to forecasting out of sample. We may want to pair down what we look at in terms of the number of variables, and that's something we'll investigate a little bit later on. But something to be aware of. Of course, anything that's perfectly correlated is just going to create all sorts of problems for the model. That's something else we could identify from this correlation matrix. Now, in the case of thousands of variables looking at visualizing that matrix is completely infeasible. But we can of course comb through it through programmatically. Or we can use data compression techniques like principal components analysis. Something a little bit beyond the scope of what we're going to be discussing today. Now, the next thing I want to talk about are train-test splits, which should be done at the very start of this process. So I'm doing a little bit of a no-no here, but it eases the flow of the discussion. What we need to do, is we need to take that sample of data and split it into pieces. A trainings piece on which we'll be training our model, estimating our model, trying to find the best model. Then a test piece or a holdout sample that we only look at once we've finalized the model. What we want to avoid doing is over-fitting. We don't want to build a model that can describe the data we have really well. Because then when new data comes in, it may function very poorly. The classification accuracy may be very poor. So we really want to train our model on the data that we have and only once we've settled on a model, determine if it's worthwhile on the back end with the test data. We're going to train-test split our sample. I'm going to do it. So we've got, here's our full data-set, 10,540 observations. My training data has 8,432 observations. My test data has 2,108 observations. What I'm showing you here are the variable averages for all of our predictors as well as the outcome variable. So investment-grade is an indicator equal to one when the observations investment grade is zero, otherwise. What I'm showing you here is that the average for each of these variables across the full training and test sub-samples are very similar. In fact, what I'm not showing you here are test statistics statistically testing the differences between these numbers. They're all very small, so they are economically and statistically indistinguishable, which is what you should expect if you just randomly allocate observations to training and test. So let me just summarize. We've done all of our data acquisition verification, our data preparation/model preparation. What we're ready to transition to now is the actual modeling, which is what's coming up next.