Now, let's see how machine learning can address the problem of predicting bank failures. The model we want to try here is logistic regression that we introduced in this lesson. To remind you the formula of logistic regression, here it is. If X is a vector of all features of a bank such as financial ratios or macroeconomic variables and W are adjustable weights, then the probability of a bank failure is a Sigmoid function of the sum. So let's discuss our dataset and features that we want to use for logistic regression. First, our data set includes 471 banks that failed between 2001 and 2015. In addition, it has 9,375 non-defaulted banks for the same time period. So our data is very unbalanced. We have much less positive examples than negative examples. I mean, positive according to our terminology not positive for banks or the FDIC events. We can balance the data by using what is called downsampling. In doing this, we keep all records for failed banks one year prior to the failure and in addition, keep about equal number of records for non failed banks. So let's keep 500 random records for non-failed banks as for times of these records. They can also be sampled randomly among cold days corresponding to one year prior to failure for all failed banks. As a result, we have a balanced downsized dataset of about 1,000 records for the failed and non-failed banks. Now, let's talk about the features we use for this problem. The data set that we have contains a number of financial ratios such as net income to total assets, non-performing loans to total loans, logarithm of total assets and so on, as well as some macroeconomic factors such as the GDP growth, stock market growth and so on. All these predictors can be used in the present problem. Though it turns out that some of them are very important while others have a low predictive power and can therefore be skipped all together. Finally, we have to make a test dataset. This can be done by randomly splitting our dataset into the train and test data sets. In the experiments that I will show you next, I had 310 failed banks in a train dataset and 161 failed banks in the test dataset. Now, before looking at the results of such logistic regression model for bank failures, let's just take a look at the data itself. In these graphs, I show you scattered plots of various financial ratios for different failed and non-failed banks. Each point on the graph has two coordinates. The x coordinate is the logarithm of total assets for the bank. The y coordinate is a particular financial ratio. Failed banks are painted red while non-failed banks are shown in green. As you can see here, the red points are nearly linearly separable from the green points except for a couple of outliers. These pictures on their own, should make us quite optimistic about the results that we expect from logistic regression for this problem. This appears a very clean data relatively rare case in finance. In accordance with these expectations based purely on visualization of the data, we find that logistic regression works very well for this problem. The graph on the left-hand side shows you the so-called ROC curve for this problem. We did not talk about such metrics such as ROC curve and the related measure called the area under curve or AUC as we'll cover it in more details in our course on supervised learning. But qualitatively, the steeper the curve goes on this graph the better. The accuracy score which we did explain in this course is 96 percent which is an excellent result for models of this sort. The graph on the right shows you the decrease of the test error obtained with the TensorFlow implementation of logistic regression for this problem. Finally, the graph on the bottom has to do with the problem of feature selection. There are multiple ways to select the most predictive features for a given machine learning problem. One of the simplest ways is to look at the p-values of different predictors in logistic regression. The graph on the bottom illustrates another approach to feature regression that is based on the use of an algorithm called random force. This algorithm which we will discuss in our course on supervised learning, provides an alternative model for predicting bank failures. It turns out that it works as well as logistic regression for this particular problem. But in addition to providing an alternative predictive model for the same problem, random forests can also be used to find the most important features in our problem. Each feature is represented by a bar on this diagram, and the height of each feature indicates the importance of this feature for the problem. As this diagram suggests, there are only a few features among all features that are present in our dataset which are really important. I'm not showing you which features are the most important ones. To find this, will be a part of your homework for this week. Where you will analyze the problem of bank failures among other assignments. So bank failures was our first use case for classification methods in finance. There are also many other financial applications for probabilistic classification models. For example, predicting consumer defaults on credit cards or mortgages. It can be done using the same methods. In trading, some tasks are commonly formulated as classification problems as well. For example, for value investing that we discussed earlier, all stocks can be classified into undervalued and not undervalued. Respectively, when such classification is done, you can use it to come up with an investment portfolio by buying the most undervalued stocks and selling the most overpriced stocks. In your homework for this week, you will develop your practical skills in terms of floor by working with a neural regression and classification models using equity fundamentals data and bank report data. The Jupyter Notebooks that you will be working on in this assignments will be based on the notebooks that we used in our demos. So this was our very busy Week 2 that was devoted to supervise learning, and its uses in finance. In the next week, we will talk about unsupervised learning. Good luck with your homework, and see you next week.