[MUSIC] You've now learned how to organize your data and combine multiple data files. It's time to face one of the biggest challenges with almost every real data set, missing values. The best way to prepare for dealing with missing values is to understand the data you have. Understand how missing values are represented. How the data was collected? Were missing values are not supposed to be, and where they're used specifically to represent the absence of data. Identifying missing data is the first step in determining what to do with it. In this video, you'll learn about the different types of missing values and how to identify them using MATLAB. A variety of factors can lead missing values including failure to load the information, corrupt data, and incomplete recording of the information, among others. It's important to understand why values are missing in the data because they can make your task of learning from the data very difficult, if not impossible. In fact, incomplete data can introduce bias to your analysis which in turn can produce incorrect predictions. Also, identifying and understanding missing values is important for handling the remaining data correctly. Let's start by understanding how missing data is classified. Typically, data scientists classify missing values into three main mechanisms. Missing At Random. Missing Completely At Random. And Missing Not At Random. These mechanisms will help you later on to decide which approach to use when handling these missing values. Missing At Random means that the reason why the variable is missing is not related to its underlying value, but is instead conditional on other variables. For example, imagine you have data containing information about Blood Pressure in the population. Missing values in this data set can be conditional on Age. In some countries, older people are more likely to have their Blood Pressure checked during a regular checkup than younger people. This has nothing to do with the value of their Blood Pressures. So in this case, the pattern for missing values is conditional not on the Blood Pressure reading itself, but on another variable, namely the Age of patients. Missing Completely At Random means that the missing data mechanism is unrelated to the values of any variables, weather missing or observed. For example, imagine a data set created out of survey responses for customer satisfaction. These types of questions are typically optional. If you were to separate the results for complete responses versus missing responses, you might notice these two groups have no correlation with the customers ages. Each customer simply chose to skip one or more questions at random. Finally, Missing Not At Random means that the reason the variable is missing is related to the value of the variable itself. For example, if your survey now asked people about their incomes, there may be a higher probability that people with higher salaries do not want to reveal their income, meaning they might just skip the question. In this case, the data is missing because of the very same values you're trying to collect. This is the most difficult category as the incorrect handling of this missing data could lead to strong bias in your results. So what about the flights data set? Well, imagine you listed only the Scheduled Departure Time, Taxi In, and Cancelled columns. The missing values for Taxi In appear to have no relationship with the Scheduled Departure Time, and there's no indication that the reason the values are missing is because of the values themselves. However, looking at the Cancelled column, you understand that a missing value for Taxi In is conditional on the cancelled status of each flight. So this is assumed to be a case of missing at random. The Scheduled Departure Time does not predict when the value is missing, but this value is conditional on another variable. This understanding will help you later to determine the best way to handle these missing values. Now, regardless of the mechanism for missing data, you need to identify the missing values. How do you do that? In MATLAD you use the function ismissing to identify the missing values in your data. Ismissing (A) returns a logical output that indicates which elements of the input contain missing values. The size of this resulting array is the same as the size of A. However, note that standard missing values depend on the data type. For numerical, and duration values, and NaN, not a number, will be interpreted as a missing value. For example, when using ismissing with the TAXI_IN column, note how the entries with NaN return a logical value of 1. But for daytime, string, or categorical variables, missing values will be identified differently as shown in this table. For example, consider the actual departure time variable, notice how the missing values are identified as NaT or not a time. Notice also how the missing values in the tail number column are identified as undefined. Note that missing data can also correspond to data that has been incorrectly recorded. For example, imagine the value for Air_Time was recorded incorrectly as 0 or the Taxi_In value was recorded as a negative number. You would want these values to be identified as missing. What do you do in cases like these? The standardizeMissing function replaces values specified an indicator with standard missing values. So for Air_Time recorded incorrectly as 0 or the Taxi_In recorded as a negative number, you would use standardizeMissing to replace those incorrect values as missing. To recap, you learned that the reasons for missing data, generally, fall into three different categories, which are helpful in determining how to deal with this data. You also learned how to use the function ismissing to identify missing values, as well as standardizeMissing to replace any value with standard missing values. Next, you'll learn how to handle missing values using this information.