Welcome to the Big Data demo using BigQuery on Google Cloud platform. Here, we're going to show the serverless scaling features of BigQuery. How it scales up automatically behind the scenes without your intervention to query large data sets. We're going to talk about 10 billion rows of Wikipedia data. So the first things first, we're going to follow the demo scripts, all these demos in the code that I'm going to be running. Everything I'll be running is inside of our demos folder in our public repository. So first up, we're actually going to copy the query in the clipboard. And navigate to, no, I don't want to search for it. Navigate to BigQuery, inside the Google Cloud platform, I already have BigQuery open. But if you need to navigate to it, navigation, I have it pinned up here. It's kind of like starring it, but if you scroll all the way down, under Big Data you have BigQuery. And to promote things, you don't have to continuously scroll in search. I just pin them, so I'm often using AI Platform notebooks for machine learning work. Composer for data engineering work, BigQuery for data analysis work. Once you're inside of BigQuery, we're going to Paste in our query in the query editor window. Where you're getting data sets from, where your data actually lives is under resources. And one of the very popular public data sets that's available is Wikipedia. So inside of, it's one of many, so you can actually get airline data for flights. Reddit data, geographic data and Wikipedia Benchmark for very very large data sets. If you were given a script, one of my favorite hotkeys that you can choose is you can actually hold down. It's on a Mac, it's Cmd key, on a Windows, I think it's Ctrl. And that will actually highlight, or I think it was the Windows key, that will highlight all of the tables inside of your query. And it turns them into buttons, so if you clicked on this, you automatically get back to the schema. So it's a great way to iterate between what are the columns. And what are the details in the preview of the data versus the query results as well. So again, that's just that cool hotkey, all the shortcuts that I mentioned are available. If you open up that model, you will get the shortcuts there as well. So 10 billion rows, this is really 10 billion rows, fastest way we can find that out is in the Details. It is just about a gigabyte, here we go, 10,600 million rows of Wikipedia data. What type of data are we talking about, what are we actually going to be querying here? The schema is not too wide, it is the year, month and day. The Wikimedia project and the language that it's in, and the title of the Wikipedia page and how many views it has and it's just a lot, a lot of rows. So what are we going to do, what type of operation are we going to do? Well, you can see our query, when we're going to run, it is going to go through 10 billion rows, which is about 415 gigabytes of data. Let's see how fast it does that, it's going to return not only just columns, but it's going to do a calculation. So it's basically going to say, give me the language the Wikipedia page was written in. Give me the title of that page, give me the total number of views. Where somewhere in the title of any of these articles, the name Google was featured. It has to be a capital G, the SQL is case sensitive. I'll show you how to do ignore that in just a second with a function. And of course any time you're doing aggregations, you need to group by. And we want to have the pages that have Google somewhere in the title. The top pages by view count first, which I'm assuming it's just going to be a page called Google. Let's go ahead and run this, how long does it take to process 400 gigabytes? And we're running and again, you're not a DBA. You're not managing the indexes or anything like that, you just have your SQL query. It's not even our data centers, we're just using somebody else's data set, and you can see how long it sorted for. When I recorded this video, we got 10 seconds, 415 gigabytes processed and here's your insight. So it reached out and found 10 billion rows, 10 billion pages of Wikipedia data that's stored here. Looked into, a LIKE is a rather expensive operation, it's going to not only look at the columns, it's going to to look into that string value. Find if it were Google somewhere, appears anywhere within there. The wildcard character percentage sign is any characters before, any characters after, and sum up those total views. And it did that pretty quickly, so in total, there are 21400 pages with Google somewhere in the name. The most popular pages is the English page for Google, the Spanish page for Google shortly after that. And then Google Earth, Google Maps, and then Chrome as well. Now, of course, if you wanted to make this not case-sensitive, one of the things that you could do is you could say I wanted to wrap the title. And everything is gotta be uppercase and then you would have to do this as well. So you just match LIKE for LIKE, so if you're doing wildcard operators using LIKE, it's going to need you to use UPPER LOWER. Or if you're experienced with RegEx, you can do that as well. So that is 10 billion, and you can see what the really cool thing behind the scenes is on the execution details. You can see how it actually did this, so it took you, the human, while you're watched it, 10 seconds, you're just watching it. Behind the scenes, it took all of the computers. If you were to do it serially, linearly, stack all the computers. All the work that they did, it would be 2 hours and 38 minutes for one computer to do it essentially. But that's the beauty of distributed parallel processing, that happened behind the scenes. These, you don't have to care about how many virtual machines were spun up to do this work. But in aggregate, they did about three, almost three hours of work automatically. And they shared a lot of data in between themselves, as well. And you can see the process of going from those 10 billion records, all the way down after the aggregations. To outputting the result that you see there, all right, that's cool. 10 billion, let's see if we can do 100 billion, so let's see if we have a data set. I think it's literally just adding another zero, why not? Why not go bigger, right? And again, if you want to get back to that data set, I'm going to hotkey it. We got more information here, yeah, we do, we got the title. It's largely the same schema details, okay, cool. We got a real big data set, we get 6 terabytes, lot of records. Same principle, an expensive operation, when you go into every single field. How long do you think it's going to process, to take to go through 100 billion records, open up every single title? And then see whether or not, somewhere in that title, is a string of the letters Google. Once it's got that result, it has to take that and all of its friends, over the other 100 billion or those that match, and then sum them all together. So the virtual machines have to communicate with each other, when they're doing aggregations. That's where that shuffling step comes into play. It seems this data is going to process, so less than a minute. Just over 30 seconds, it went through 4.1 terabytes of data and it gave us the result there. And you can see almost a full day of computing. If you're going to be doing that, just on a single machine, and it doesn't even tell you how many machines were there behind the scenes. So that slot time is a phenomenally interesting metric that just shows you this scale. You waited 31 seconds behind the scenes, you didn't have to manage them. We're using 24 hours, essentially, if computer are boom, just like that. And when you don't need it anymore, obviously you're not paying for those machines, you're just paying for the bytes of data that were processed. All right, that's the demo of BigQuery at scale