[SOUND] Hello. We're standing outside the National Center for Supercomputing Applications, or NCSA building. NCSA, which was founded in 1986, is a hub of transdisciplinary research and digital scholarship at the University of Illinois. Originally founded as one of the four initial national science foundation supported centers for supercomputing research, NCSA has expanded its research into advanced cyber infrastructure, including data access, analysis, and archiving. In this module, we explore the concept of model overfitting, and advanced computational techniques, to avoid or minimize the effects of this machine learning. Overfitting is where a machine learning algorithm learns too well to reproduce the provided training data. This might seem like a good thing, since the model predicts extremely well on the data used to train the model. However, the model will fail to generalize or to accurately predict on new, unseen data. To introduce why this is a problem, we will explore the bias variance tradeoff, in which we strive to minimize the bias in our model predictions by using more complex models to completely capture the signal. At the same time, we strive to minimize the variance in our model prediction by reducing the impact of small fluctuations in the training data, which might simply be the result of noise. Next, we explore the technique of cross-validation as a means to minimize the likelihood of overfitting. Cross-validation builds on an important concept where we split the data into training, testing, and validation data. In this approach, we train the model on the training data, determine the optimal model hyperparameters on the testing data, and validate the best model on the 24 unseen validation data. Cross-validation generally involves multiple steps in this process, where the data are divided into multiple samples, and different parts of the data are used for training, testing, and validating in each pass. Thus, we end up using multiple parts of the data to cross-validate the multiple model predictions. We also demonstrate this process by using learning invalidation curves, which indicate how our model performs with new data sets. One important aspect of cross-validation is the opportunity to select optimal model hyperparameters, while also minimizing the likelihood of overfitting. We will explore different techniques for accomplishing this goal by using the scikit-learn library, and different forms of cross-validation, such as leave-one-out, k-fold, stratified k-fold, and shuffle split. As well as the technique of grid search for hyperparameter optimization. Finally, we will explore formal statistical techniques to minimize overfitting that are known as regularization. These techniques work by adding extra term to the cost function minimized in the machine learning computation that penalizes complex models. The three main forms of regularization that we will cover in this module include the lasso, ridge, and elastic net. Lasso uses an l1 norm as a penalty term, while ridge uses an l2 norm as a penalty term. Elastic net uses both along with a hyperparameter that regulates the mixture of the two penalty terms. By the end of this module, you will be equipped to complete real world machine learning tasks to classify or predict new insights from complex data. You should feel proud, this is a real accomplishment. Good luck.