Two methods for dealing with dirty data. First, clean it through SQL, second, as I've alluded to quite a lot, there's a UI tool called Cloud Dataprep that is going to be the basis of two labs for this module. You can clean it with SQL. If your data needs to conform to business rules, you can do a lot of things directly within SQL and BigQuery. You can say this field is allowed to be null, meaning that it's allowed to have an empty value or it's required field, if I want them to put their first name and last name, I'm going to have to put both those fields. You can proactively check whether or not those values are null after the fact. You can check things like if then or in SQL, it could be a case when or if null do this operation. There's a lot of conditional logic that you can apply to test for data validation. You can also, in upstream systems, since BigQuery is an analytics warehouse, require those primary keys, require those constraints and those upstream systems before it even gets into BigQuery. If you're the BigQuery data analysis admin and guru of your data analysis group, you can put these constraints on these source systems. If someone's giving you bad transactional data, don't clean up their mess, tell them to take this class, and to clean up their data and validate it before it even makes it into your systems. Confirming your data is accurate. Maybe it matches the datatypes, it follows everything about whether or not it's allowed to be null or required. But you need to make sure if it actually makes sense. Say the item price, for example, item price times quantity ordered always equal the subtotal. You can set up a test or a check for that inside of SQL. Or potentially you can look up against an objective reference dataset. If you say, well, you can have these 50 states in the United States, you could have a lookup table that basically says, is this state that's in this column within using the IN operator there, this list that I have? Using a subquery or JOIN what you're going to cover a lot more in the second part of this course is one of the ways you can actually nest that statement directly within the IN operator there. We'll cover that. Completeness. Completeness is a hard one. Completeness is exploring the data to see if anything's not there. Checking for the skew, as you mentioned before, looking for those null values, and handling those null values. If it's null, do this, or if this value isn't present, rollback on this value. If that value is not present, rollback on this value, that's the coalesce function there. You can enrich your dataset with other datasets, mash them up together to form their complete picture. Say you have IRS tax filings for 2014 and 2015, and you want to bring both of those together in one ultimate consolidated table, you can do that through UNION. Or say you had 2014, 2015 data already UNION together, and then you wanted to join in organizational details like the name of the charity or the address. You can enrich your data set by doing a JOIN. We'll cover more of that in the Cloud Dataprep lab. Cleaning it. It's got to be consistent. As we mentioned, one fact in one place, and depending upon what dirty data you're dealing with, there's a whole host of functions, as you've seen before, parsing out dates, cleaning up a lot of dates, is what data analysts spend a lot of their time doing, myself included. Sub string parsing, doing that left-right, that's all handled through this substring in standard SQL, and then replacing values as well. Again, here, the earlier you do this before we start, any of your really complex analysis, the cleaner you make your dataset, the better you're going to be. It doesn't mean you have to replace your raw data table. All these steps are going to be parts of a recipe that continue on after or a downstream, so this logic is repeatable when new dirty data comes into your system. Last but not least, uniform. You want to make sure that all your data types are similar and that you're comparing apples to apples, especially when it comes to visualizing your results. This is where you're going to be in Data Studio correctly labeling what are the units for your charts and graphs that you're creating as well. Throughout the rest of your code, particularly for that Mars rover burning up example, if you're using the English system or metric system, document that heavily using comments in your code. Here's an interesting example. We've got some weather data that we're pulling from NOAA, and we're basically saying, give us all the weather data where the US state is not null, meaning that it is present and limit the top 10 results. This is me just throwing another trick question at you guys. The state has to be present. The question is, why does the below query still show these blank state values when is clearly filtered on is not null? The correct answer is it's because it's not null. It's not null because it's blank, it's a valid blank value. You can't see it, normally if you could highlight over it, it would be like an empty string or someone hit the spacebar once. Because a valid null value in BigQuery looks at exactly what that latitude and longitude looks like. It'll say N-U-L-L there. If someone in their wrong mind wanted to actually store a string value as N-U-L-L, you have my permission to yell at them because that would just drive a data analyst absolutely nuts. But you would still probably be able to determine whether or not it's a valid null string stored as N-U-L-L or a blank with enough SQL analysis. But if you see that one, just let me know.