What I'm going to do is I'm going to go to this query data in Athena with Amazon S3. I'm going to click on my work. And then what I'm going to do is I'm going to go through and run this lab. So you can see that we're going to use Glue. And Glue is an automatic ETL system that's serverless. It integrates with Athena, Redshift, and also EMR. So Athena is a serverless data query tool. Redshift is a data warehouse. And then EMR is basically Spark or [INAUDIBLE]. It's a map reduce system. And so what we're going to do is we're going to use Glue, so make sure you understand how to you do that. We're going to crawl some stuff, and then we're going to query it using Athena. So I'm going to go ahead and say, Start Lab here, and that will take just a second to start it up. Yes, sure. Okay, so yeah, let me fix this real quick. What I will do is, Good question. So what I can do here is, I think I didn't publish them yet for you. So I'm going to go to Vocareum. And then I'm going to go to UNC here. I'm glad you asked me this. And I'm going to go to assignments. Let's go to DNA analytics, and go to assignments. Go here, publish. That's why, I don't know why they do this. They hide these at first. So you now, very good question. There we go. So you should now get access to this. Is that correct? If you refresh, it should start appearing. There we go. Yeah, it should. Or maybe log out and log back in again. I think it's instant. It should be instantaneous. But yeah, okay, perfect. Some people see it, okay, great. So great question. And I'll publish the rest, the other ones later that we have access to, great question. So now that that's there, next I'm going to click on AWS here, and, What I'm going to do is probably just put these side by side. So I'm going to go like this so I can see this one, and then I'll put this one right next door. So we can kind of look at both at the same time. Okay, so the first step here is let's get in here. In this scenario, you are data analysts working at a international development agency. We're figuring out some drought relief data, or this could be coronavirus, or whatever currently is happening in the world. You can see here, the diagram is that the IEM system has to make sure that the user has access to these resources, Glue, Athena, and S3. So the first thing we're going to do is we're going to crawl data inside of Glue. And so that's, again, what Glue is designed for. And I know many people that are doing data engineering in real companies, and this is actually one of their tools. It's a great tool. So first step, we're going to go to Glue. So I'm going to go here, type in glue. And you can see here it's a fully managed ETL system. And next step, what we're going to do is we'll choose Get started, and then we're going to add tables using a crawler. So Add tables using a caller, and then we're going to type in weather right here. [COUGH] And then the next step here is specify crawler data source. And so we're going to click Data stores. And what that means is that we can choose S3 as a data store, or I've actually used this before in consulting projects JDBC. Which you can point it to some external database that lives in the other side of the world. And you could periodically pull in that data and index it and transform it. So it's a very flexible and powerful system, also, to talk with Dynamo. We're going to leave it, I believe, at S3. We're going to grab this NOAH data, which stands for some kind of scientific thing. [COUGH] I know somebody that works there. I don't know exactly the acronym. So we go here next and then, we'll say, No, Next here. Not sure why that's popping up. Let's see this, paste this in here, next, Add crawler. If you have to make it bigger, just. Okay, yes. So I just have to make a little bit bigger. Okay, so then I go back here and I say, choose an existing role. So again, this is something they created for us just to simplify things. But if you were using this, you would need to make sure that your IAM role had the ability to talk to, for example, S3. And then we're going to go to Next, and then we're going to select Run on demand. So this is another really interesting component of Glue is, if you are doing this on a real project, probably to start with, you want to say Run on demand so that you can just test it out. And basically what it's going to do is it's going to crawl something for you later. Once you get the system working, then you would go and make it run at an hourly interval, or a monthly interval, or something like that. So we'll say Next here, and then for the database we're going to say Add a database. And we're going to put in a name of a database, and we'll call this weather data. And essentially, what this is doing is it's allowing us to have a catalog. And then we're going to scroll down here and we'll say Next. And I believe that's all set up. We don't have to do anything else. Okay, so now it says, Run the crawler. So we're going to tell AWS Glue to run it now, and you'll see it. We'll say, Run it now. And basically, it's going to go inside of S3, and it's going to look inside of that folder and it's going to find all the metadata about what's inside that folder. So if there's files in there that have data, it will find the types, it'll find the names of the columns. And that's, again, the power of something like this is you can write all this code to do this yourself, for sure. But why would you do that if you could just click a button and do the work of 50 people? And this is really the part of the cloud, I think, that is very disruptive, many organizations right now have tons of software developers recreating poorly. What is already, just a few button clicks of the cloud. So next up here, we can say read. We can read more if you want to do that. We we crawl, that said okay, after one minute, it'll change the ready. And then we can refresh this thing, sill starting. And then we can actually, we'll see that a table has been created, so we'll just kind of let this thing go. Go here for a second. Let's see, Mhm, there we go. Maybe even I'll expand it a little bit. Status, one minute elapsed. Yeah, I think there we go, stopping. So I think it created it, okay, so now, now it created that information. And now what we're going to do is we're going to, you'll see the table added column. Right, so mhm, and what we're going to do inspect the data AWS Glue captured. So if we click on this, we can I believe we can inspect it by going to the tables. No, yeah, go to tables here and then you can see, like it figured out all this data, right? So I figured out here is the name of the column, here's the data type, it even shows you, like, what kind of data is it? So, so pretty, pretty powerful system that automatically does something that you would have to write some python code to do, for example. And so now we're going to go to databases, we're going to go to weather data, and we're going to choose tables and weather data. And I guess that's what I just looked at. And we see this and you should see something similar to this, right? We have this and now we're going to edit it. And again, this is another cool thing about this is you can edit the schema on the fly here. And so we're going to, now, tell it, we want to clean up the date a little bit here, so we're going to, let's see here. So we want to change the columns by selecting them entering new names. So we're going to choose edit schema here, and then we just change. Yeah, we just we're going to change these, so I'll just scroll this over a little bit like this, and we'll say that's going to be called station. Well, I guess I have to work hard here and actually type something. So station and the next step would be, date and then type observation. So this would be very similar to if you had, like, let's say, an existing database and you had maybe columns that were a certain name and then you were talking to another database you wanted to integrate. You could then change the name of their columns so that they would be the same name, so it would be easy to integrate the data. That might be one example. Let me see my messing this up here. So we got station date type observation in flag, and then we have s flag and then we have Q flag, and then the next one is time like that. Okay, we're going to say save. So we've now changed the schema, so that's got meaningful names here and now we're going to query the data using the glue catalog. So we're going to go to the tables icon here, and then we're going to click on this C s V and then we're going to view the data. So from the action menu, is there a view data here? Where is this few data? A few properties, what am I missing this? So check the table box. There we go back here, so check this. There we go and go to view data preview the data. Okay. Okay, and then this will say get started. And what's cool about this is that you must specify a storage bucket to hold the results from the queries that we run. So, we're going to go to s3 now. I'm going to open this up on another tab, and we're going to go to here, and I believe there's a bucket. Yeah, right here, if I click on it. Let's see, so the bucket properties, there should be an ARN here. Let's see, select the bucket name to use. Copy bucket ARN, where's copy? Bucket, bucket, ARN. I think I don't overview, why am I? Why am I messing? Or is it here? There we go, copy bucket ARN, okay. So now that I've got that, we're going to return to Athena and we're going to do, set up a query result location. So we'll go here and we'll piece that in. And I believe if I do a control A I can just swap this out a little bit and just change this to S3 slash, slash control A and then do a backslash as well. And I don't think I need to, so, yeah, you have to do this. Swap that at the beginning and I believe it should look. Yeah, and then these aren't completed. So we'll go and say save. And And now what I can do is for the list of databases will choose the weather data. We got that and then choose a table CSP. We got that, and then there's a little ellipsis here, and this is a little tricky to find. This allows us to query it like create a default queries. So we're going to click on this, and I believe we just say preview table and then it creates a select statement for us. So again, pretty awesome is that if there was petabytes of data in here like we have already catalog, we have already indexed it. We know what's inside. Like look, we can see this is all the preview data, and I can run this query, and it's going to show me all the all this data, so, and you can see here, this is the 1st 10 records of that table. So next step, what we're going to do is we're going to create a table for data after 1950. So we're going to create an S3 bucket to store the external data. The bucket should be in the same region. Follow the information, so I believe what we're going to do. Copy the following query, remember to place your bucket, okay, So let's go to S3 now. And, let's create a bucket and we'll call this bucket whatever today is, March 1 UNCC Demo. This may not work for you because it's all bucket, names are unique. [COUGH] okay, so we got this. I'm going to click on it. I'm going to copy this bucket. Actually, I think I just need to copy this part of the name right here, there we go. So now I'm going to go back to Athena, and I'm going to put this into a query. So I'm going to make a new a new query, and I'm going to paste this. First, I'm going to paste this in. So I'm going to put a sequel query in here. I'm going to copy this paste in here and then take this bucket name. If you're doing the same thing it would be whatever bucket you created. And then we're going to create a new table, and we're going to put the results of this into this bucket, which would be that for me. So we're going to say, here's the format of the data. Here's our external location. And I missed an important part. [LAUGH] the where clause, which is the one that actually finds us the date we're looking for. So basically, we're saying just give me the data between 1950 and 2015 and then put it into this bucket that I created. [COUGH] so, now that we've done that, I think we say return and I can say save. Actually, this should just work here. Let's see here. I don't know why, [COUGH] the run query is not working. Let's try, let me do this. Let me just paste this in one more time here, Paste that in and then I'll go back to the bucket name that I had, which is this one. And we'll paste this in here, okay? And maybe I put format queries, don't have to do, sure, I'll format the query. And then I believe I can just do a Ctrll+Enter to run the query. There we go. Ctrl+Enter and then it's going through this. It's going through all the data inside of the original S3 bucket filtering based on our query. And then it's going to put the results I believe, [COUGH] into that other bucket. So let this thing crank here. So again, you can imagine the complexity, you're working with petabytes of data and someone hires, look, you're a graduate of the data science program. Let's have you query his data, I think many people would reach for Python first, right? You go through and, hey, let's just start playing around some Python code. You could be out for months trying to figure out how to get this done versus if you use the right tool for the job. Step one, index, it, step two cook it up to a serverless query system like Athena. So behind the scenes here is spinning up servers, doing all kinds of stuff, and then running the sequel query. And then you can use the skills you already have just write some sequel statements and then go through and filter that data. So we can see here that it was successful. And now look, I can click on this, and I can say preview table. And then we can we can look inside of our table here, and you can see all this data that I was able to get from that time period. And I think if I go back to S3 now, and I refresh this, that you can see that it put, actually, this is a pretty big dataset. Wow, how big is this dataset? If I go back into S3 here that we're querying this unsaved here, Those don't look that big. But I guess because this is just the metadata, but in this scenario we copied the real data to here. And I think if I went to properties, it should tell us, anyway, we can see here these are big files. These are parquet files, are compressed data files here. But now we have easier ways to partition the data and then query it. So next up here, we're going to run a query from the selected data. So we're going to put this in here. So I'm going to say, go through here, and run another query against the table that we just created, this late 20th. Again, we'll say, Ctrl+Enter to run it, query was successful. [COUGH] and then we can also go to the view that we set up here into the data and then go through here and say, preview. And then this will show us that results. You can essentially treat this gigantic data lake as a database. And use big data tools to query the data. And you can see a fairly straightforward process.