In our last video we dealt with the concept of limited dependent variables, using Lending Club data. The dependent variable that we had in our previous examples was the loan status which we said took on a value of 0 if the loan was fully paid or 1 if the loan was charged off. Our independent variable DTI was the ratio of debt payments to income. We were testing the hypothesis of whether we could use debt to income to forecast the loan status, that is whether a loan was charged off or not. What we showed was that because loan status can only take on the value of 0 or 1, linear regression is not well suited to this type of analysis. In order to think about this problem, we first want to discuss the concept of an odds ratio. This is a familiar concept to anyone who has been involved in gambling before. So for example in horse racing a horse might go off at 3 to 1 odds, what this means is that for every one time a horse wins it will lose three times. Alternatively, we could say that of every four races a horse will win once, and so this is a 1/4 or a 25% probability of winning. The odds ratio is essentially just the inverse of the odds that you typically see in gambling. So here we would say that the odds ratio is 1:3 again, for every one time that the horse wins three times we'll expect that it doesn't. Looking at lending trees summary statistics for the data set that we looked at, there were 34116 cases in which the loan was fully paid, and 5670 cases in which the loan was charged off. This means that without any other information we would say that the odds ratio is 5,670:34,116. That is, there's about a 16 point six two percent chance, which is gotten by dividing 5670 by the sum of 5670 and 34,116 that the loan will be charged off. What we want to do is use additional information to sharpen this odds ratio to see if relative to just the basic chance that a lone will be charged off. We have information that can predict whether that odds ratio will be lower or higher. As we discussed before since our dependent variable is 0/1 a linear regression doesn't make sense in this context. Instead what we are going to do is assume that the natural log of the odds ratio can be used in a regression. Specifically what we'll say is that the natural log of the ratio of the probability of a charter off given a level of debt to income. Divided by one minus the probability of that charge off given a level of debt to income, can be expressed as a line a plus b times x, where again x is the ratio of debt to income. This function is called a logistic function, and regression using this function is called a logistic regression. The regression tells us how likely the outcome y equals 1 is given a value of x. So again in our particular context we are asking how likely it is that the loan will be charged off given a value for the debt to income ratio. If I perform a logistic regression with the data that I have available to me, I will wind up with the following results. I will get an estimated coefficient for the Intercept of -2.04 and a slope coefficient for the debt-to-income ratio of 0.018. And as you can see from the regression both of these T statistics are quite a bit larger than 2 and absolute value. And so we would say that the debt-to-income ratio has significant forecasting power for whether a loan is charged off or not. But these coefficients by themselves don't mean a lot and let's unpack what their actual meaning is. Just like in a linear regression, the Intercept term is telling us something about outcomes when the debt-to-income ratio is 0. Putting this into the logistic function, the regression is going to say the following. The log of the ratio of the probability of a charge off given that the debt to income level is equal to 0, divided through by 1 minus the probability of a charge off. Given that the debt-to-income ratio is equal to 0 is equal to -2.0401. So again, what we are saying here is that the log odds ratio is equal to -2.04. Now I don't have a lot of intuition for thinking about what things mean in logs. So what we want to do is convert this so we can think about this and sort of normal numerical terms. We make it more meaningful by rearranging this particular equation. The first thing we're going to do is exponentiate both sides of the equation. And so what we'll do is look at the exponential of the log of the ratio of the probabilities, being equal to the exponential of -2.0401. What this is going to tell us is that the ratio of the probability of a charge off given that debt to income is equal to 0. Divided through by 1 minus that probability, which is the probability that the loan is not charged off, is equal to 0.13. What we'll do now is just a little bit of rearranging in order to solve for the probability of the charge off given the debt-to-income equal to 0. Multiplying both sides by the denominator of the left hand side, we'll get that the probability of a charge off is equal to 0.13 times 1 minus the probability of a charge off. Bringing all of the terms related to the probability of a charge off to one side of the equation, we can see that this results in 1.13 times the probability of a charge-off given a debt to income of 0, is equal to 0.13. Finally dividing through by 1.13, we'll see that the probability of a charge off given a debt to income ratio of 0 is 0.13 divided by 1.13 or 11.5%. So now we know that if the debt-to-income ratio is equal to 0, the probability of a charge off is 11.5%. What happens if the debt to income ratio increases? The slope coefficient is going to tell us how much higher that probability gets with each 1% of debt to income. So if we assume that the debt-to-income ratio is equal to 1, we can now say that the exponential of the log odds ratio given that the debt income is 1. Is equal to the exponential of -2.0401 plus 1 times 0.018047. That is, the exponential of the Intercept plus 1 times the slope coefficient. Again, rearranging this expression what we'll see is that the probability of a charge off given a debt to income level of 1% divided through by 1 minus the probability of a charge off given a debt to income of 1%, is equal to 0.1324. Rearranging, we will come down to the probability of a charge off given a debt to income ratio of 1 as 0.134 divided by 1.1324 or 11.69%. So again, what this is telling us is that if our debt to income ratio is 1% instead of 0%, there is a slightly higher chance 11.69% versus 11.5% that the loan will be charged off. In general, what we want to know is what the probability of a charge-off is given a debt to income level of x. What we will find is the probability of a charge-off given a debt to income level of x, is equal to the exponential of -2.04 plus 0.018 times x, divided through by 1 plus that exponential. So for example, if a borrower has a debt to income ratio of 20%, the probability of charge-off will be the exponential of -2.04 plus 0.018 times 20 divided through by 1 plus that exponential, which comes out to 15.72%. So here we see that an increase from a debt to income of 0% to 20% increases the probability of a charge-off from 11.5% to 15.7% or an increase of about 4.2%. If we are to plot this function, we can now see a nice closer to linear relationship between the debt-to-income and the probability of charge-off. So unlike what we had before where we simply had the 0 and 1 variables. We now have something that's much more familiar to us as a regression line equation. We could refine this prediction further by adding more independent or explanatory variables, credit scoring algorithms do exactly this following an iterative process. They first perform a logistic regression on a subset of the data to determine which variables are statistically significant. They then test the logistic regression on another subset of the data to see how accurate it is out-of-sample. We then used the model in real decisions and gauge its accuracy. Finally, the credit scoring agency refines the algorithm and repeats. Much of credit tech is done through machine learning, which is essentially just using the machine learning to perform iterative tasks. So looking at a large data set and trying to determine which variables are important for determining credit worthiness or not. The machine essentially perform steps 1 and 2, It learns by gauging how well it did in step 3. And then it refines its algorithm using the previous information.