In module one, you've completed the first two steps in the crystal TM framework, business understanding and a data understanding. In module two, you will work on the next three steps. Data preparation, modeling and a model evaluation. Please open your notebook now and be prepared to pause this video and practice in a notebook while watching the lecture. First, we will prepare data for our model. If a notebook is just opened, you need to learn all about course cells first. Just select the cell you start with, then click "cell run all above." When we invest on a loan, the biggest concern is whether the loan will be paid off or not. The loans in this data set are all initiated before 2011. All the loans in the data set are either fully paid or charged off. We have this piece of information in the loan Status column. We can see that the majority of the loans are fully paid. Since Machine Learning models can only deal with numerical data, we will need to encode loan status. Here, we map charged off to zero and fully paid to one and create a new column, repaid. This repay column will be the label in the classification model. That means we will later build a classification model to predict whether a loan will be repaid or not. With the newly created repay column, we can easily calculate the pay off rate of the whole data set with the main of re-pay column. We can see that about 85 percent of all loans in the data set are fully paid. You will complete data preparation and data preprocessing by completing the tasks in module two. In task 2.1, you will create a new column, loan-term year out of term column by mapping 36 months to three and 60 months to five. In task 2.2, you will explore the relationship of several categorical features in a re-pay rate. The label encode these categorical features. In the Task 2.3, you will handle missing values in two columns. Revol_util, which is a continuous feature in a pub_rec bankruptcies, which is a categorical feature. Most of machine learning models require training and testing data without missing values. We have to handle missing values in all features we're going to use in the model. For continuous feature, the simplest way is to fill missing values with the mean value off the column. For categorical feature, the simplest way is to fill with the mode of the column. That's what you need to do for this task. Before we move on, please pause the video and complete the first three tasks in module two. After data preparation and data preprocessing is done, we can construct and train a classification model to make predictions, We will demonstrate this step with the random [inaudible] classifier. First, we define columns that we're going to use. Feature selection is the most important part in modeling. We've learned some techniques for feature selection such as filter methods [inaudible] and embedded methods. Please refer to Machine Learning for encounter with Python for more details about feature selection. But the most effective way to select features is through business understanding and data understanding. For the purpose of the demonstration, we have provided a list of columns that we're going to use. However, I highly recommend you to explore all of the original features and come up with your own selection. The key columns is the list of columns we're going to use in the next steps. Among the columns, repaid is the label of the classification model. Total payment will be used to calculate portfolio return in Module 3. Others can be used as training features for the classification model. When you choose training features, make sure you only choose those features that are available at loan initiation. Features that become available after loan is paid off or charged off cannot be included, such as total payment or recoveries. We then create a clean DataFrame df with only the cut columns. Then we define a list of model columns which contains all features used to train our classification model. You can see that there's no repay column or total payment column in this list. Now we split the dataset to train and test_set. Normally, we will split features and label separately. But here we split the DataFrame df into df_train and df_test. This is because eventually we will need df_test to construct a portfolio in Module 3 and we need total payment column to calculate portfolio return. We cannot use df_train and df_test directly in the classification. We will define d_train and d_test out of df_train and df_ test by only keeping columns in the model columns list. Then define l_train and l_test from the re-pay column and df_train and df_test. Now that our train and test data is ready, we can construct a classification model. Here we will demonstrate with a Random Forest Classifier. First, we create a default Random Forest Classifier. Only set random state to ensure repeatability. We name the model rfc1, then we train rfc1 with d_train and l_train. After that, we can predict with a trained model on d_test. Lastly, we compare the prediction with the true outcome of d_test, which is l_test, to evaluate the model. In our demonstration, the model achieves an accuracy score of 84.9 percent. It means 84.9 of all predictions are correct. It looks okay, but we need to compare it with the performance of the 0 model. A 0 model always predicts with the majority class in a dataset. In this dataset, the majority class is one or fully paid. We can check the repay rate of d_test, which is 85.1 percent. It means a 0 model actually achieves 85.1 percent accuracy, which is better than our Random Forest Classifier. Does this mean our model is useless? When you think about the purpose of our classification, again, we want to select a loan portfolio with the help of the classification model. We will select loans that are predicted to be repaid by the classification model to form a loan portfolio. Therefore, we don't need to care about the overall accuracy rate. The only thing we care is the repay rate up or loans that are predicted to be repaid or Class 1. This is the precision of Class 1 in the classification report, which is 0.85. It means that among all the loans that are predicted as class 1, 85 percent of them are accurate repaid. If we form a portfolio with these loans, we actually achieve 85 percent repay rate, which is similar to the whole test data set. Now, let's take a look at class 0 recall, which is 0.01. It means among all loans that are charged off, our classification model is able to predict one percent of them correctly. This is pretty miserable, but still a little better than the zero model. Since the 0 model predicts all loans with majority class, which is 1. The zero model will have zero recall rate on class 1. The confusion matrix shows the detailed numbers up the prediction. The rows in the confusion matrix represent the true outcomes. The column in the confusion matrix represents the predicted outcomes. The second column is the number of class 1 predictions. If we choose a loan portfolio with this prediction, there will be 1,177 plus 6,758 loans in the portfolio. Among them, 6,758 loans are actually fully paid, and 1,177 loans are charged off. The repay rate of this portfolio is about 85 percent. Before we move on, please make sure you fully understand what is precision and what is recall in the classification report. You can refer to model evaluation section in Machine Learning for Accounting with Python for more details. At this point, the portfolio is not better than the whole test set. Let's see how we can improve it. Seems the only thing we care is class 1 precision. We can improve class 1 precision by adjusting the model hyperparameter class weight. In the next code cell, we construct our random forest classifier in set class weight to balanced. Balanced class weight set with inversely proportional to class frequency in the input data. In the data set, about 85 percent of loans are paid or class 1. By setting balanced class weight, we set class weight on class 1 to about 15 percent and class 0 to about 85 percent. Which makes it more difficult for the classification model to predict alone as class 1. But if a loan is classified as class 1, it's more likely that the loan is fully paid. The default random forest classifier is not very sensitive to class weight change. We limit the max depth of the model to seven. Now, we train and predict with this new classifier. The accuracy score of this new model drops from 85 percent to 67 percent. But it has higher Class 1 precision, which is about 88 percent. If we form a portfolio with the predictions of this new model, the portfolio will have 649 plus 4,860 loans. Among them, 4,860 loans are actually paid off, and the portfolio achieves 88 percent repay rate. For module 2, you will need to train and evaluate a logistic regression model with different class weight settings. You may use the model columns used in the demonstration. You can also define your own model columns. If you choose other columns for your model, please make sure you encode categorical features properly and handle all missing values in the columns. Once you finish all the tasks in module 2, you can answer module 2 quiz questions to wrap up module 2 activities.