In this video, I want to start talking about the data that we're going to use for our classification model. Let's see. Here we go. So, remember, let's take a step back first, right? Everything starts with the scientific process clearly articulating the question. Starting with some hypothesis, developments of guesses, some answers to the question. Understanding the empirical implications of those hypotheses and then testing it out in the data. Where we are now is really in testing it out with the data. That is the start of the data science workflow, and that begins with acquisition and verification. So I'm going to grab Standard and Poor's Compustat database which contains credit rating and financial information for all publicly traded firms in the United States, as well as some privately held firms as well. That will be the base of our data. I'm going to impose some screens, do a little bit of cleaning and verification as we always do. But not bore you with that at this point in time. Which isn't to de-emphasize the importance but simply to gloss over a bunch of tedious code that had to be done to prepare this data. The result of that is a sample consisting of 10,540 observations for 1,400 firms ranging from 1995 to 2016. So remember from the last video in which I spoke about from your observations, we have data for 1,400 firms at multiple points in time, right? So let's take a look at our data. In particular, let's do a little exploratory data analysis, EDA, okay? What I'm going to do here is just plot the distribution of ratings in the sample. And what I've done to sort of ease the presentation is I've collapsed ratings, these are S and P ratings, into whole letter buckets. So what I mean by that is the AA bucket consists of AA minus AA, and AA plus. The BBB bucket consists of BBB minus BBB, BBB plus. And similarly, for most of the other ratings where it's relevant. And so what we can see here is something that's roughly symmetric, almost bell shape in some sense, not to suggest that it's normally distributed for number of reasons not the least of which is that it's discreet. But you can see that most of the data, most of the observations are clustered here around this dashed line, which is meant to distinguish between investment grade to the right and speculative grade to the left. So this is where most of the data is between the B and A rated firms. In fact, we see relatively few, 93 from your observations out of the 10,000 that are AAA. And even fewer that are rated CC, which is on the verge of default. Now, this of course is not our outcome variable. We're not trying in this exercise, which is stylized but hopefully illustrative, to predict the specific rating notch or even the letter rating bucket in this case. What we're trying to do is distinguish between investment grade and speculative grade. And if I look at that distribution here, where 1 corresponds to investment grade, 0 corresponds speculative grade, you can see we get about really almost a 50 50 split in the ratings. 51.1% of the 10,540 observations are investment grade, and 48.9% are speculative grade. It's actually going to make our job a little bit easier in some sense, because we've got nice representation of both categories, both classes. In contrast to some datasets such as bank fraud, where the vast majority, 99% or even more sometimes of the data is non-fraudulent transactions. And you have a very small fraction, very small number of fraudulent transactions that you're trying to identify. Not the case here. This last table, what I'm going to do is I'm actually going to take my sample and bifurcate it into two groups, okay, based on whether you're speculative grade or investment grade. And then I'm going to compute the average credit risk KPI that we discussed previously for each of those groups. And I'm going to run a little paired t-test and show you the t-statistic over here in the third column. So if we look at the first row, I see that speculative grade firms have an average current ratio. Remember, current ratio is a measure of liquidity, current assets to current liabilities of about 1.78. Investment grade firms actually have a slightly lower liquidity ratio of 1.64, which at first glance might seem a little bit odd. But remember, the investment grade firms don't need the liquidity to support their finances. They can run sort of a leaner meaner operation on the liquidity side because they've got a lot more money coming in through operations relative to any sort of financial obligations they might have. We see that, in fact, the current ratio is statistically significantly larger among speculative grade firms and investment grade firms as indicated by this t-statistic of -3.29. So looking across the current ratio, quick ratio, and cash ratio, we see that on average, speculative grade firms actually tend to have more liquidity. Now, when we turn to coverage ratios, most notably the interest coverage ratio, debt service, and cash coverage, we see the exact opposite. Investment grade firms have a much stronger coverage profile than speculative grade firms. And let's focus on the interest coverage ratio to make the discussion a little bit more specific and precise. Remember, interest coverage is the ratio of EBITDA, a proxy for operating income, to interest expense. So the average investment grade firm has $13 of operating income per dollar of interest expense, whereas the average speculative grade firm has only $5.38. If we throw in principal, those numbers will both move down, but you can see how we're getting very close to almost one for one among speculative grade. Speculative grade firms have $3.55 of operating income per dollar of interest and principal owed over the next year, compared to the $6.27 that investment grade firms have. And these differences across the speculative grade, investment grade divider, aren't just economically significant. They're statistically significant as well as suggested by the large t-statistics. Finally, the last category of credit risk metrics, the leverage ratios, show something that's relatively consistent with our coverage ratios, albeit in reverse. Let's think about debt-to-ebitda. The debt-to-ebitda ratio or leverage ratio of a speculative grade firm is 4.25, suggesting that they have on average $4.25 of debt outstanding for each dollar of operating income that they're pulling in. Compared to the $1.97 of debt outstanding for each dollar of operating income that an investment grade firm has. So clearly, right, expected of grade firms are quite a bit more heavily lever, and we see that across the board in terms of all of the leverage ratios, right? If we look at debt-to-assets, $0.45 of each asset is funded with debt as opposed to equity for speculative grade in contrast to the $0.26 of debt financing for each dollar of asset among investment grade firms. And these are all highly significant as well. So we can see that there are stark differences in the credit risk characteristics, credit risk KPIs between speculative grade and investment grade firms. And that's important because that's going to prove useful when we want to go predict or classify firms of speculative grade or investment grade. Because we'll get a lot of spread, big differences between things like coverage ratios, liquidity ratios, and to a lesser extent, liquidity ratios. Put differently, these appear as though they will likely be useful predictors for our classification problem that we'll turn to next.