Now here's one of the modules that everybody loves instead of make your queries run faster and save on data processing costs. In this performance optimization module, we'll cover the key types of work that BigQuery does in the back-end and how you can avoid falling into performance traps. Then we'll cover one of the most common data traps, which is having too much data skewed to too few values and how filtering can be your best friend here. Last up is your best tool for analyzing and debugging performance issues, the Query Explanation map. Now there are four key areas that govern a lot of the characteristics of BigQuery performance. Let's take a look at each of them in turn and see some examples. First and foremost is how much data that you're reading and writing out of BigQuery, you have your input and output. Second is that communication between your workers or slots. It's done via the shuffle mechanism that we talked a little bit about before and the architecture lecture. Third is highly computational work like LARGE functions, UDFs mathematical operations, that's how much of the CPU you're using. The fourth, which is my favorite is the SQL syntax. Is there more efficient ways to write your queries? Or what are some of those pitfalls that we can avoid when we're writing that SQL. Let's look at each of those. First up is input and output. How much data we're reading and then how much data we're writing. As you might have expected, select star makes the list here. We only want to use the columns that we're actually going to use. If you want a quick preview of that dataset, again, use the preview mechanic that's built into the metadata and the BigQuery UI. Second, BigQuery performs the best when your data is denormalized. That means instead of having a lot of smaller tables and doing join to bring the data back together, have all those DataTables be combined into one massive table or a denormalized table, as we saw in the architecture lecture, and use things like nested and repeated fields instead of having orders in one table and customers in the other, consider actually bringing those together in a parent-child relationship through that nested and repeated structure that we looked at it a little bit earlier. Lastly, as you saw on the merging and union of datasets, try to use those really granular suffixes in your table wildcards to get the most specificity out of your data. As an example here, if you remember that GSOD or JSON tables for weather. Having something like GSOD star is little bit less specific than saying GSOD20 star to get only the years from the year 2000 on forward. Next, let's talk a little bit about BigQuery native storage versus the external data connections that we discussed a little bit more ingesting those datasets. Let's cover some of the downfalls of external data connections. First, they'll never be able to be cached because caching as a function of BigQuery managed storage itself. Second, live edits to underlying data sources. If there's a spreadsheet that you're relying on data from just to quickly pull that into BigQuery. If you have somebody that's editing that spreadsheet live at the same time, you could create potential race conditions and BigQuery is not going to know which dataset is the latest to bring in from that spreadsheet versions. Lastly, BigQuery has a lot of performance tweaks under the hood that you might not even realize. When you're writing things like a long query, if you filter your dataset all the way at the end, what BigQuery might actually do is it recognizes that filter that can actually be performed earlier, and it'll do something that's called predicate push-down. Without even you knowing it, it'll actually bump up your WORK clause much earlier in the query and then filter that data before it gets processed a little bit. Now keep in mind there are particular use cases where you want to use an external data connection, like grabbing in ad hoc data and then performing a onetime analysis on a much like in your extract, transform, and load step, and then ultimately storing it. You don't have to continuously do that. But keep in mind the performance hits and the trade-offs that you're making there. Next up is this shuffle mechanic. One of the most interesting things behind the hood is how those workers communicate to each other to break apart your query and assign that work to a lot of different workers or slots behind the scenes. Now we can do to make their lives easier as pre-fill to your data before doing massive joins, see if you have 80 percent of your records are null, go ahead and lab off those rows even before passing that work to the workers. That saves a lot of reads and it saves a lot of communication time passing those data between each of the different slots. Now one of the things I'll highlight here is that you can identify potential data skew if you have one of your workers that it's getting particularly bottlenecked. We'll take a look at how you can identify that and recognize that a little bit down the road. Here's a few fun querying examples. When we talked about the WITH clause, we said use it liberally and help break apart those complex queries. But one of the caveats or drawbacks to that is you don't want to use a bunch of WITH clause in place of materializing the results of those queries into permanent storage. Here you see in the query on the left, we're using a WITH clause to define a join against the IRS 2015 filings data against the organizational details table. If you ran this query more than once, you'll continuously doing that joint over and over and over again. It's much more performance just to store lines 6-13 as a permanent table and then refer to that later. Now the main takeaway with WITH clauses is that the results of those queries, those subqueries that you've named in that WITH clause like line 6 and line 15 here, those results are not materialized and they're actually requerried every single time if you're referencing those tables more than once of one of your long queries. It's a trade-off between readability for your code and also performance or speed when you're storing those tables as permanent. Here's another example that demonstrates that same concept. For example, if you're only interested in looking at the charities that are schools in the IRS 2015 dataset, and you're commonly passing this filter over and over and over again, it's better to actually store those results as a permanent table and then query those directly. Now let's revisit our friend, the GROUP BY clause. If you remember, the GROUP BY actually aggregates those values over this subset of data that you have. In this particular case, we're looking at the number of Wikipedia contributors and then grouping the amount of edits that they made on the amount of IDs or editors that there are for that Wikipedia dataset. If you can imagine there's a ton of different contributors and doing such a large group by potentially creates many what we call forced shuffles or forced communication steps between your underlying slots or workers. One of the ways you could get around this is potentially performing another level or high level aggregations on your dimension, the contributor here to bucket them. You have fewer buckets that you're grouping by. We'll look at this example in greater depth in just a few slides. Moving on to the computational aspect of BigQuery, if you wanted to optimize the CPU usage and you're getting resources exceeded errors, take a look at any JavaScript user-defined functions you might have and see if there's any way to revisit those concepts within native SQL functions or potentially SQL digital file functions. The reason being here is that BigQuery actually has to fire up a Java sub-process to run a lot of your JavaScript code. That doesn't mean you shouldn't use UDFs. The UDFs have their place when there is functions that are just outside of the normal scope of SQL capabilities. But just keep in mind that performance trade-off there. Now as we discussed a lot in the joins and unions lecture, keep in mind that you want to fundamentally master your data model on how your data tables are related to each other. The relationship between customers and orders, one customer can have many orders, or one order can only belong to one customer. Going through that mental exercise like what we did with the weather stations and the temperature data, will help you determine whether or not the results sets that you get makes sense. Now one of the specific examples that we looked at earlier was uncovering the insight that one particular charity filed more than once for the tax year 2015. Doing something like a sum of all revenue across all those rows would bring in data that was potentially duplicative across multiple charities. Keep in mind understanding that data model or working with your subject matter experts to determine what those unique row conditions or field should be is critical. Next, if you're combining data historically, be sure to use that table suffix and be as granular as possible so you can avoid reading more data than you intend to use. Lastly, if you've used SQL bunch before you might have run into self joins, consider using those Window functions instead for optimal performance. Another great performance tip is to keep intensive operations like order binds to be the very last thing that you're doing in your queries. For example, here as you can see, we have a sub-query, a named sub-query there with line 12, we're actually ordering it. But we actually don't need to use an ORDER BY there because we're not doing anything with ordered results. Now one of the easiest ways to spot this new queries if you're doing more than one ORDER BY. If you have an ORDER BY like you see here and the named sub-query on line 12. Since we're not actually doing anything with the order of that data there, we can actually remove that ORDER BY all the way to the end if you wanted to do something like ORDER BY charity names. This allows you to take the advantage of any filtering that's going to happen before. We're only ordering by those results that actually matter to us in the very end.