Welcome back, this is the third course in the data analyst specialization. Here we're going to achieve advanced insights in BigQuery. Now for the first module, we're going to continue our journey in SQL by delving into a few of the more advanced concepts like statistical approximation functions and user defined functions. Then we'll explore how to break part really complex data questions into step by step modular pieces and SQL by using common table expressions and subqueries. Let's start by revisiting the sequel functions that we covered so far. All right, it's time to revisit some of those functions that we skipped over in previous courses as we move into more and more advanced sequel concepts. So the three ones that are on the agenda are the statistical functions, which includes some pretty fun approximations, analytical functions. If you've heard of things like window functions or partitioning in the query. And last but not least UDFs or user defined functions can be written in SQL Javascript. Let's tackle them one at a time. First up, some pretty generic statistical functions that you can do on your data. So you can imagine like a standard deviation or a correlation analysis. This is just a pretty example, just to apply some basic statistics. So here we're looking at the IRS data set and we're getting the standard deviation of the employee accounts. And also for the second correlation, we're passing in how correlated is the revenue that one of these charities brings in? Two, the amount of functional expenses that it's actually spending on their charitable programs and offerings. So that's the ultimate question is how correlated do you think those two are? And the closer it is to one, naturally that is going to be a very tight one-to-one correlation. So just running that SQL there and you'll see the value here. Standard deviation for employees, about 1500 employees. And then you can see that revenue and expenses for these nonprofit charities are very tightly correlated. And again, this is for all of those records in that 2015 filing data set. Look at that .97, extremely highly correlated. All right, that's enough of the basic statistical functions. Let's move into some more of the, other ones. So here's some approximate aggregation functions. So if you wanted to approximate the amount of data that's in your data set and you say you have an absurd amount of rose, what you can use is APPROX_COUNT_DISTINCT. So much like you've seen before, was the COUNT_DISTINCT there on that second line. You can get that exact count, but you can also get the approximation which can be good enough to get the job done. Now you're thinking in the back of your head, all right, well, the COUNT_DISTINCT ran in under five seconds from my data set that we were doing in our labs. Why on earth would I ever want to use the APPROX_COUNT_DISTINCT, right, I can get the exact count, right? And me coming from the finance background, we wanted everything down to the penny, right? But imagine, you're in this scenario where you work at Google and you have to count all of the logs, all the user logins over a multiyear period of time. And you're talking about petabytes of data and you don't have all the time in the world. And you want to actually sacrifice a little bit of accuracy for speed in which you get your query back. Especially when you're talking about counting all the logs that are processed or counting all the ads that are served. Very popular to do that APPROX_COUNT_DISTINCT. So let's look at an example here. So this is an example from one of our big core developer advocates, Phillips. He's an amazing SQL writer and what we're going to invoke here is that APPROX_COUNT_DISTINCT. So the task that we have at hand is the Github data set is public on big query and that includes all of the user logins and all the public code on Github. That's a very fun public data set to play around with. And here what we're doing is we're counting the approximate number of logins as an approximate value. And what we're looking at is to see it by year. And here you'll see those concepts that we covered in previous courses, where you have that union table wildcard there, that's that asterisk and that from statement. And then the concatenation that table suffix, again, is one of those concepts that we covered in that merging data sets topic, okay. So we now have an approximate count of users, but what you might be asking in the back of your minds is this, how accurate is that to the actual account? And say we wanted to get a more realistic count of all of the users since the period of time, because this data right here could include the same user that is logged in over multiple years. How do we parse that out? And to do that, it's a little bit more of a complex query and probably the most complex one that you've seen to date is part of the specialization. So let's take a deep breath and just walk through it step by step, line by line. The first thing that you're going to notice is we're invoking what's called a common table expression all the way at the top. And we'll cover that WITH clause shortly. But in essence, you can think of that as a temporary table. So it's basically saying anything in those parentheses for that WITH github_year_sketches. Anything within there, that select statement. We're going to be storing that kind of a temporary table that we can use in a subsequent query. And the reason why you do that again, is to break apart a very complex question into multiple steps. So you can do a processing layer on the data and then query those results within that same query. So that WITH statements or common table expressions as they're called are extremely useful. Okay, so let's get into the guts of the query. So that first query, what we want to do is we want to get the approximate number of logins. And what we're going to do is invoke a function called the HyperLogLog for another approximation. And what that's going to do is it's going to also estimate the amount of logins just using a different function that's available. And there's a long white paper written on HyperLogLog and it's an externally available algorithm and it's used all the time in things like statistical estimation as well. Okay, so once we've done that, the great thing about this is we can actually estimate year over year, over year, over year. So we're doing this for however many years of Github data that we have. I think on the previous slide, it was 2000 to current and we're getting an estimate year over year, over year, over year. And in the HyperLogLog name space, those are actually called sketches. So we're getting a collection of sketches which is in a sense, our estimates of counts. And in that second block of code at the bottom, we can actually add those sketches together, which I think of summing those up, right? And you can actually do that through that HLL.COUNT.MERGE. And that merges all those estimations together as an approximate number of those unique users. And then that again is from, as you see in the last line there, that's from that common table expression that github_year_sketches that we created earlier. Okay, so there's a lot to cover, but at the end of the day it ended up with 11 million approximately Unique Github Users. And then a different query, that has not been shown here where we actually ran to get the exact count. We found that this approximation was 99.7% accurate. So again, here's the tradeoff between whether or not you want to get 100% accuracy and have your queries run a little bit longer, or if you're fine with just executing the approximation functions. So if you see these in code along the way, or you start writing these approximations and estimations on very large data sets, just know that those are out there and they're available for you to use. And feel free to read a blog article for a little bit more detail on that. So approximation functions are another tool in your toolkit for a very, very large data sets, so feel free to get familiar with them. And we'll come back to those WITH clause is a little bit more.