We mentioned it a few times, but let's talk a little bit about the main tool in your toolkit when it comes to diagnosing performance, and that is the Query Explanation map. Let's talk a little bit about what you're going to see here. It's a lot of horizontal bars, and this represents the amount of time that each of your workers is spending in each of those key stages. Now because your data and your query is massively mapped and distributed, you could have multiple stages as your workers communicate between each other to get the job done. Now you can see how many rows are being processed in each stage, all the way on the right. The top shows you the input and then the lower right ultimately shows your output, which is the results of your query. As we mentioned before, having a high cardinality or high distinct number of Wikipedia editors in this particular case makes it not the smartest query to use for a large GROUP BY. The main takeaway here is the grouping by something that has many distinct values equates to many force shuffles or communication, you can think of it as communication, between each of your different workers. They're also called shards or slots, many different names for the workers that actually perform the work in your query. That's because again, the key thing to remember is that each individual worker and sometimes you can have upwards of 2,000 workers working on a query at once, is selfish and likes to have a partition of your data where it has all of the IDs on its own local memory to perform those aggregation operations or any other querying operations across that subset of your data very efficiently. Two more concepts we want to cover. One for performance reasons. We talked a little bit about this when we talked about performance remedies for the Wikipedia example that we solved by saying, Hey, if you wanted to potentially just look at one particular year, like last year, you could actually save that data table in a separate permanent table. Similar to that is the concept of table sharding. Now what table sharding does is if you have a large data table but you are only you're using a certain section of it over and over again, you shouldn't have to pay the costs performance-wise of scanning every single records. All the way on the left, what you see is traditionally in relational database systems, you'd have a database administrator or DBA that would take a look at your database and then manually partition or pre-allocate that space in very defined partitions on disk. You'd bet pretty apt to VM a hardware master to deal with that, to get those individual partitions set up and perform it along with your queries. Second later on down the road, what actually came out is you can manually shard your own tables and typically as you saw with the GCOD, whether examples that we've covered in the joints and unions, you'll see a year prefix on the end of it. With that one, you saw GCOD in 1929 all the way up until today's a data table. It results in many different tables but you're not scanning all of those table records if you just wanted to query last year's temperature data. Now one caveat I'll introduce here for that middle section where the sharding the tables is that there is a what I'll call a transaction cost or an overhead cost that you're going to incur with creating all those different tables and that's because there's metadata associated with each of those different tables that BigQuery itself needs to store. The last and best way that if you came across this issue and you just really want to query most recent data like 2016, 2017, 2018, is to actually set up a special version of your table, which is called a date partition table. It can be just like a normal table, except that it actually has what's called a partition column in it and this is just like a date time field, as you see here all the way on the right. The example that I'm going to go with is if you have a massive amount of customer orders historically going back, say 30 years, but you really only interested in querying things from like the last three years, you want to wait a partition off the table in and of itself so that you're not scanning over everything, as is typically the case when you're just doing a select query. What you can do is you can actually predefine your table when you create it not only the caveat here is that when you actually create the table, it has to be setup this way as a partition table. Then you can specify a date column there as an automatic partitioning field where we can partition by day, partition by month, partition by year. Let's take a look at what that query would look like if you're querying from a partition table. Going with that third best-case example, where you have a single table, so you don't have to worry about doing an union wildcards or anything like that and you have a partitioning column setup there on each of those different days. How you would query that is through one of the reserved fields as you see there in the where clause and the left called partitioned time. Then we can do there is setup between a timestamp of such and such and then another different endpoint there and then what that'll do is the query will go on and it won't scan through every single record you have in your massive table, it'll just only look at those partitions. This is one of those unique cases where we talked about earlier where your where clause will scan every single record in the table. This is one of those caveats where that's actually not actually the case. When you have a partitioned table automatically set up by date partitions, then the where clause can actually filter out individual partitions before finding the results of your select statement. That's pretty cool concept. Last up to cover, if you really interested in performance, all of the BigQuery logs are actually outputted the Google Cloud Storage and then read into Stackdriver. Stackdriver is the performance monitoring tool, not just for BigQuery logs, but also for all other Google Cloud products. You can actually export all of your BigQuery logs directly into Stackdriver and take a look at things like your worker or slot utilization, how many queries do you have in-flight? How much bytes you've actually uploaded? How much you're storing up persistent disk? If you took that to the next level on this next slide, you'll actually see using Stackdriver logs, you can get out individual queries that are performing poorly potentially, or queries that are costing your organization the most when looking at that pricing model there and you can perform what I like to call the BigQuery audit. Now, one of the big tips I have for you if you're setting up your Google Cloud project for the first time, be sure to actually go into billing and as the second bullet point there says, you can actually tick the box that says export your billing logs into BigQuery. Now that'll actually show you all of your auditing and all of your billing history directly queryable in BigQuery, much like you'd like to run some SQL analysis on it and then you'll be able to build Google Data Studio dashboards like the one that you see here on the left. It's time for a recap. Performance, since you're paying for the total amount of bytes that you're processing, you want our first limit the amount of columns that you're returning and also consider filtering out your rows by using that where filter it whenever you can. Now this doesn't mean you can't write queries that process your entire dataset, BigQuery of course, was built for petabytes scale, you just want to be mindful of the resources that you are consuming. We then covered the Query Explanation map, where we get those cool visual cues of the types of work that is most demanded by your queries. Note here that a large difference between the average and the max could indicate a heavy data skew, which can be helped by filtering your data as early as possible. Now we've covered some SQL bad behavior, like selecting every record and ordering it without using a limit clause. Finally, if you're only accessing a recent data, like the most recent events, consider using table partitioning to reduce the total amount of bytes that are scanned by your query. Now let's troubleshoot some of these next queries in this next lab.