Welcome to Dealing with Missing Values in R. In this video, you will learn about the pervasive problem of missing values, as well as strategies you can take when you encounter missing values in your data. When no data value is stored for a variable for a particular observation, this variable has a “missing value”. Usually, a missing value in dataset appears as a question mark, “N/A”, zero, or just a blank cell. This example shows the first twelve rows of the airline dataset. There are a few missing values in the cause variables like Carrier Delay shown here, as well as other variables that we will explore later in the lesson. There are many ways to deal with missing values, and this holds true whether you’re using Python, R, or any other tool. There's no one-size-fits-all approach; Different approaches fit in different contexts. Fortunately, there are some typical options you can consider: The first is to check if the person or group that collected the data knows something additional about the missing data and can determine what the missing values should be. Another possibility is to drop the data where the missing value is found. When you drop data, you can either drop the whole column or just the single data entry with the missing value. If you don’t have a lot of missing data, usually dropping the entry is the best option. If you’re dropping data, you want to take an approach that has the least amount of impact. To avoid wasting data, replacing data entries is often better than simply dropping them. However, it is less accurate because you need to replace missing data with a guess of what the data should be. One standard replacement technique is to replace missing values by the average value of the entire column. Or, in the Airline dataset, you can replace it with the value zero based on the assumption that most flights do not arrive late. What you replace missing data with is very dependent on the nature and context of the data. For example, suppose you have some entries that have missing values for the “ArrDelay” column and the column average for entries with data is 20. While there is no way for you to get an accurate guess of what the missing values under the “normalized-losses” column should have been, you can approximate their values by using the average value of the column, 20. But what if the values cannot be averaged, as with categorical variables? For a variable like “Reporting_Flight”, there isn’t an “average” reporting flight type, since the values are not numbers. In this case, one possibility is to try using the mode – the most common flight or drop the data point. And, of course, in some cases, you may simply want to leave the missing data as missing data. For one reason or another, it may be useful to keep that observation, even if some features are missing. You can identify missing data in the dataset by checking the value of the is.na() function. This first function uses is.na() to count the number of missing values in a specific column, which is the “CarrierDelay” in this example. In this second function, it uses is.na() to count the missing values in all columns. All columns are specified by using the dot. From the output, you can see which variables have missing values, like CarrierDelay, WeatherDelay, NASDelay, SecurityDelay, and so on. Now that you know that you have missing values, let’s see how to drop or replace missing values. The easiest way to drop missing values is to manually drop specific rows or columns. In R, the complement of a set is given by the hyphen “-” operator. This example drops the 2nd, 4th, and 6th rows from the dataset. Similarly, you can also drop the 2nd, 4th, and 6th columns from the dataset. To remove data that contains missing values, the tidyverse library has a built-in function called drop_na(). Let’s look at an example. Notice that “CarrierDelay”, “WeatherDelay”, “NASDelay”, “SecurityDelay”, “LateAircraftDelay” all have the same number of missing values from the summary. By inspecting the data, you can see that dropping the missing values in one column will also solve the missing value issues in the others, because they appear in the same rows. You’re trying to predict the delay of flight arrival in your upcoming analysis, so you should remove the data points that don’t have departure delay variables. You can do with one line of code using drop_na(data). In the drop_na() function, you need to specify the column names that contain the missing values you want to drop. So, actually, entire rows are being dropped. This example drops the missing values in the “CarrierDelay” column. Don’t forget that this line of code does not change the original dataframe. The modified data will not be saved unless you point to a new variable. In this example, the new data is saved as “carrier_delays”. Now, if you compare the original dataset dimensions with this new one, the new one has less rows. This is because for every row that had CarrierDelay equal to NA, that entire row was dropped. To replace missing values, like NaNs, with specified values, the tidyverse library has a built-in function called replace_na(). For example, assume that you want to replace the missing values in multiple columns with the value 0. Since “CarrierDelay = NA” means there is no delay in Carrier, the delay in minutes is zero, or “CarrierDelay = 0”. Let's replace the missing values in columns `CarrierDelay`, `WeatherDelay`, `NASDelay`, `SecurityDelay`, `LateAircraftDelay` with zeros. You should always consult the documentation if you are not familiar with a function or method. The tidyverse web page has lots of useful resources. There are, of course, other techniques you can use to deal with missing data, such as replacing missing values with the average of a group, instead of the entire dataset. You learned to drop problematic rows or columns containing missing values. And then you learned how to replace missing values with other values. But don’t forget the other ways to deal with missing data. First, you can try to find a dataset or source with higher quality data. Here, you don’t need to worry about this because the data is from the Data Asset eXchange is a reliable source to acquire datasets. Second, in some other cases, as the last resort, you may want to leave the missing data as missing data. In this video, you have learned about the importance of identifying missing data and developing a strategy for handling it. This may involve dropping or replacing the missing values or working with the source of the data to either determine what the missing values should be or provide a better source of data. And sometimes the best strategy is to leave the missing values in place.