Model evaluation and selection is an integral part of the development process in machine learning. In this video, we're going to explore three main aspects of model validation. We're going to see how we can sub-sample our training and testing data, what are the underlying hypothesis and constraints, we're going to explore performance measures, and finally, we're going to explore how we can use significant testing to compare the performance of two or more algorithms. In an ideal scenario, we would like to have data from all representative population. In this case, we could train and test the machine learning algorithm on the same data. In theory, the error we would obtain in this case, would be similar to the real error rate when the number of sample becomes very large. In reality, though, the error we obtain when we train and test the algorithm within the same data set is positively biased. This is the reason which in real application, we split our data into training data and testing data. Usually the percentage is about 70 percent training and 30 percent testing. However, if we have million of samples in some applications, we can afford to have 90 percent of training data compared to 10 percent for testing. But what is the reason for these numbers? In real applications, we estimate the empirical risk based on a limited number of testing sample, which they measure the loss with relation to our classifier. Variations in the empirical risk estimation can result from a number of factors. For example, we can have random variations in the testing set or in the training set. We can have random variation within the think the lending algorithm, or even random variation with respect to the noise in the class we consider. One big advantage of the hold-out method is the independence between the training set and the testing set. This provides some guarantees for the performance of the algorithm in data not previously trained on. However, we should also consider the confidence intervals around the empirical risk estimation. When we're evaluating lending algorithm, we shouldn't assume a Gaussian assumption because the loss error might be close to zero. In the case of a binary classifier, we can model the error based on the Bernoulli distribution. The Bernoulli distribution is a discrete probability distribution with a random variable that takes values one or zero with a probability p or 1 minus p respectively. In this case, the difference between the true risk and the empirical risk estimation is based on the number of samples and the confidence interval. Here data takes values from zero to one. If for example, data is 0.05, then we can obtain 95 percent confidence interval with relation to the estimation of the empirical risk. If we rearrange this equation with relation to the sample size required, we can obtain this with relation to the parameter Epsilon and the parameter Theta. We observe that the minimum number of samples required depends on delta and epsilon in a logarithmic scale. This is a problem because it means that small increases of epsilon and delta require huge number of samples. In real application, we cannot really increase the number of testing samples because this would resolve in smaller training dataset and works performance. Another way is to sub-sample the data and estimate an average prediction based on the sub-sampling strategies. The way we resample our data affects the bias-variance characteristics of the error estimation of the machine learning model. For example, if we have too few examples to test, this could result in high variance. Whereas if we have a small training dataset, this can affect the bias behavior of the model. K-fold cross validation is one of the most popular error estimation approach in machine learning model. The process here is simple. We just divide our dataset into K parts and we use one of these parts for testing, and the rest, K minus one parts, for training. We repeat this process by picking each time a different fold to test. In this way, we have K estimates of the classifier error. These estimates can be averaged in order to obtain the mean performance of the algorithm. We can also examine the variability of the algorithm across each of these iterations. A K advantage of K-fold cross validation is that the testing samples are independent between folds. As we see, there is no overlap. K normally is choosing empirically. It is typically to have k equal to 10. In this way, we have an acceptable computational complexity, and it results in a relatively less biased estimate. A problem that can arise in K-fold cross validation is that the data may be not distributed evenly across classes. This problem becomes worse when we already have imbalanced class data, like for example, in health care applications. In this case, we can use a stratified K-fold cross validation in order to control the distribution of sample across classes to represent the similar distribution with our original dataset. A special case of K-fold cross validation is the leave-one-out cross validation. In this case, K takes the value of the number of samples in the testing in the dataset. Leave-one-out cross validation uses the full dataset for training, and this can help because it results in relatively unbiased classifier. We should point though, that if we have a very small dataset, this is not a guarantee of an unbiased classifier. The biggest problem with leave-one-out cross validation is that it becomes computationally too expensive to apply when the dataset starts growing. In a reasonably sized dataset, leave-one-out cross validation can provide better estimates in the case where the data contain extreme values. One aspect of this type of validation techniques like the K-fold cross-validation or the leave-one- out cross-validation is the fact that the estimate is not based on a fixed classifier. Each time the model is trained again and a new classifier is produced. This has advantages and disadvantages. The advantage is that we can test the stability of the machine learning models in the different partitions of the data. The disadvantage is that when we compare the performance between different algorithm, we should have in mind that we compare the average of the performance estimates across different kinds of classifiers instead of a fixed classifier, which we would have in the hold out approach. Let's not confuse model selection with model evaluation. Here I show you again the hold-out method for evaluating a machine learning model. In this case, we have split the dataset in three parts. One part is for training, one part is for validation, and one part is for testing. The validation part in fact, help us to decide which parameters to select. One example is to use the validation test during training to decide when to stop the deep learning methods to avoid over-fitting. Once we have used a validation set to select our model, then we can use the test dataset to test the final performance of our model. So far, we saw what are the advantages and disadvantages of the hold-out method and the simple resampling methods, such as K-fold cross-validation, stratified K-fold cross-validation, and leave-one-out cross-validation. We also stressed the importance of keeping separate the test partition from the tuning of the learning algorithm. In the hold-out method, this was achieved by splitting our samples into three datasets. In this way, we can use the validation dataset to tune in and select the appropriate model. In the K-fold cross-validation we can do that by in fact, using nested K-fold cross-validation. In this case, the inner loop is used in order to estimate the best parameters for our perform model selection and the outer loop used for model evaluation. Other methods like the random subsampling method is also considered a multiple resampling method. This can be of interest because it's the simplest method to resample the data iteratively, so it can't be seen as an extension of the hold-out method. However, what we lose here is the independence of the testing dataset across iterations. The difference between the bootstrap sampling and random subsampling is that in bootstrap subsampling we draw samples with replacement. This method can be very useful in the case where we don't have enough data to apply cross-validation or even leave-one-out cross-validation approaches. A variant of bootstrap sampling is the balanced bootstrap sampling method, which conceptually is similar to stratified K-fold cross-validation. The call is to create balanced bootstrap samples across all the available classes. Finally, in permutation testing approaches, we introduce randomness by reordering the data. We're interested in estimate the effect of these different reordering has on the algorithm performance. Permutation testing provides an estimate of robustness and stability of the performance. However, it's not appropriate to compare different algorithms. An important aspect of subsampling in healthcare application is with relation to a patient. For example, are we interested to evaluate the algorithm within the subject or across the subject? For example, if we look into ECG classification, and if we segment the ECG recordings of each patient into beats, then in the inter-subject scenario, we can mix all the beats together, and in this way, we can decide how to split them into the testing validation and training set. One problem that arises in this scenario is that the ECG beats that are from the same patient, they are not independent. Therefore, there is a positive bias in the algorithm. Performance estimation. If we would like to know how our algorithm generalizes in unseen new patients, then it is better to adopt the inter-subject evaluation protocol, where from the testing, we exclude patients instead of excluding percentages of ECG beats. One way to describe the performance of classification algorithm is with a confusion matrix. A confusion matrix is a square matrix with a number of rows and number of columns equal to the number of classes. Here, we observe a classification matrix for the binary classification case. We see that the diagonal elements, they represent the true positive and true negative number of predictions, assuming that positive here is the name of one of the class, and negative is the name of the other class. The off-diagonal elements, they show the false positive and the false negative predictions. Based on the confusion matrix, we can estimate a number of performance metrics. For example, accuracy is the ratio of correctly predicted observations to the total observations. Specificity is the true negative rate of the classifier, and it shows how well the classifier identified negative cases, whereas sensitivity is the true positive rate of the classifier, which shows exactly the opposite, how well the classifier identifies positive cases. Precision for a class reflect positive predictive value. Recall is what we call true positive rate or sensitivity. The F1 score combines the recall and precision in one metric. Also, it weights the recall and precision of the classifier evenly. Another way to examine the performance of a machine learning algorithm is with a receiver operating characteristic curve. This is plotted with a horizontal axis that denotes the false positive rate and a vertical axis that denotes the true positive rate. As we discussed earlier, the true positive rate is the sensitivity of the classifier, whereas the false positive rate can be also expressed as 1 minus the true negative rate, or equivalently, 1 minus the specificity of the classifier. In this way, the ROC curve shows the analysis between sensitivity and specificity of the classifier. The ROC curve is generated by varying the threshold of true positive rates or false-positive rates of the classifier. It is considered a more rounded measure of performance for a classification problem, exactly because it examines different threshold settings. In fact, it has been used not only to study the behavior of the machine learning algorithm but also to perform model selection by identifying the optimal region of behavior. For a random classifier, the ROC curve will collapse to a straight diagonal line. The area under the curve is represented as AUC and is a summary statistic of how well a classification algorithm performs. The area under the curve has been used to compare different classifiers. Also, there are a number of different variations in order to extend the AUC and ROC in multi-class scenario. When we like to compare the performance of an algorithm against another algorithm or more in one dataset or more dataset it's common to use the new hypotheses statistical testing. There are several statistical tests that can be used to validate two algorithms. However, if you pay attention to the underlying assumptions for these tests. For example, the way known t-test assumes that the distribution is normal. It also assumes that there is an independence of measurements, and we have adequate sample size. If the normality assumption is violated, it is common to use non-parametric tests. One example is the Wilcoxon signed-rank test, which is alternative to the Paired t-test. It is based upon the ranks of the absolute differences, and in this way, it's more immune to outliers. We should highlight that both for the parametric and non-parametric tests, it is possible to manipulate the results of the null hypothesis statistical testing approach by increasing the number of samples. This is an overview of hypothesis statistical testing that can be used to evaluate learning algorithms. It is consider three scenarios. One scenario is when we want to compare two algorithm for one domain or dataset. Another scenario is when we have multiple domains, and we have to compare two algorithms. The third scenario is when we have multiple algorithms and multiple domains. In the case of multiple algorithms and multiple domains, the choice of statistical methods reflect the fact that we need to correct our measurements for multiple comparisons. In this video, we examined how we can evaluate the performance of an algorithm. We discussed first how to sub-sample our data from the hold-out methods to single subsampling and subsequently to multiple subsampling. We saw what is advantages and disadvantages, and why these methods have been developed to improve confidence in the classification performance. We examine a number of performance metrics and how we can use them to compare across different algorithms. Finally, we highlighted the limitations of the null hypothesis statistical testing.