So far, we've looked at two of the challenges that we started out with. One was to be able to deal with variable data volumes during ingest and we talked about pub/sub as a way to handle that. The second challenge was to be able to do continuous processing of the data as it's streaming in, and we looked at the Cloud Dataflow as a way to do that and the concepts of a watermark and of a trigger and of a window to control how to do aggregations and how to make these computations. The third aspect is that even as the data are coming in, we want to be able to do ad hoc analysis, we want to be able to power dashboards, and so that's what we will look at next. So now, we're looking at the BigQuery part of the GCP reference diagram. So, we will look in this chapter at streaming analytics and dashboards. Now, this is one of the things that might look trivial in some sense but it turns out to be extremely difficult to do. So before I joined Google, I used to build real time weather processing systems. So, everything that we did was real time data, and so we built a system that was a messaging system that would get data from radars all over the country and carry out computations on that radar data as the data came in. So, we got new data from a radar about seven times a second, that data was voluminous, we had to deal with it. So in essence, we built something that looked like pops up, we had a messaging system. We built something that looks like Dataflow in the sense that we had the concept of streaming computations. What we couldn't do, what was really hard was this latter aspect of being able to do ad hoc queries and power, things like dashboards, without having this intermediate step. This is what we did, we would basically take the computations that we computed and save them to disk, and then all of our tools would read the data off disk. The problem with that, of course, is latency. We would tend to flush things to disk once every five minutes, and so we dealt in our real-time quote unquote real time weather processing system. We dealt with data that per five minutes old. So, that was our trade-off, but you don't need to make that trade-off. That's basically what we're talking about here in terms of being able to do ad hoc analytics and dashboards even as the data are streaming in. So, what's BigQuery? We've looked at BigQuery several times in this course. So, it's essentially a fully managed data warehouse. You don't launch a cluster, you don't know, you just have your data, it separates out, compute and store it. So whenever you want to query the data, you just submit a sequel query and the query happens. It deals with data up to the scale of petabytes but the querying is very convenient, it's sequel. The data, as is true with everything on GCP, it's encrypted at rest and it's encrypted on the wire. This is true of pub/sub. It's true of BigQuery as well. So, the data are encrypted, they're durable, BigQuery is highly available. In addition, you only pay for what you use. So, you basically get thousands of machines that come up and carry out your sequel query for you but you only pay for what you use. But the last aspect is what we're going to be concerned about. BigQuery also provides streaming ingest to unbound the data sets. So, it allows you to stream data into BigQuery, into the streaming buffer and carry out sequel even as the data are streaming in. So, best practices then is that you combine Dataflow and BigQuery. So you do your processing, your routine processing, the things that you always have to do over and over again, you have to do on every piece of data that comes in, you do that in Dataflow, and then you have Dataflow write out tables in BigQuery. Those tables allow you to carry out data-driven decisions, do ad hoc queries, power dashboards, so you're basically taking advantage of BigQuery for long-term storage because Dataflow is a processing engine. Pub/sub is an ingest messaging bus. It's a way to decouple publishers and subscribers. It's not a permanent storage. BigQuery gives you long-term storage at about the same cost as cloud storage, so it's very cheap. Finally, in BigQuery, you want to basically create views that basically provide very common query support, so the kinds of queries that people are going to do. You take those tables, you write your sequel statements to provide views to them and you give those views away so that people can use it to power their dashboards. So when you stream data into BigQuery, what kind of rates can you handle? Well, you can handle streaming up to about 100 thousand rows per table per second. Okay. So, essentially what you're doing is that you're basically calling a REST API to do it and that's what Dataflow is doing. Dataflow is calling the REST API. So, this is not magical so you can do this even outside of Dataflow as long as you make the call to the REST API. It works for standard tables, it also works for partition tables. So if you have tables partitioned by date, you can stream into those as well. The cool thing, the absolute cool thing is that you can query streaming data as it arrives, and normally what happens is BigQuery will go ahead and de-duplicate records. If for whatever reason you insert the same record twice, it basically de-duplicates it, but this is on a best effort basis. So if you want to be absolutely sure when you insert, you basically provide an insert ID for every row and that way you can de-duplicate it. Okay. But normally, you don't bother, it's fine. The data are inserted by Dataflow into BigQuery. So once you have your data into BigQuery, lets say it's in the table current conditions, so what can you do with it? Well, you can query it. So, what are we doing here? We have this inner query, SELECT sensorId and Max of timestamp AS timestamp FROM this table, GROUP BY sensorId. So in essence now, this inner query gives us the latest timestamp for every sensor. So, we know what was the last received data from every sensor and then we basically say for each of those sensors, for each of those timestamps, give me all of the data. So now, we get all of the data from a sensor, all of the most current data from the sensor and this is what you will do to power your near real time dashboard because you want the latest data to power it. So this dashboard itself, you could build in a variety of dashboard tools. We'll be showing you Data Studio. Data Studio is a dashboard tool by Google, it's free, so it's an easy convenient thing to show but there are a variety of other dashboard tools out there that you could use. So, Data Studio lets you build dashboards, it lets you build reports, and it works a lot like Google Docs in the sense that you can create a dashboard and you can share it with people and you can collaborate on a dashboard. So, it's really good for those kinds of collaboration and scenarios. So, you can do a variety of charts, the charts could be as simple as bar charts, or charts of different types, or maps. We have a variety of charts out there, we will look at some of them in the lab. Now, Data Studio can create all these graphs but in terms of what powers as input, it basically can connect to a variety of different GCP data sources. So, it can connect a cloud sequel, it can connect the big table to BigQuery, et cetera. So, we can connect from Data Studio to a variety of different GCP sources. That list keeps expanding so I'm not going to say these are the ones but go head and when you open up Data Studio, you will find a list of sources that it can connect to. In particular though, it offers a BigQuery connector and in the BigQuery connector, you can read from a table or you can run a custom query and you can use that to populate it, and you can build charts, graphs, you can do maps, you can do a variety of things. So, the thing that we are going to do is that we're going to basically get traffic data from a sensor and what we want to do is that we want to monitor each of the lanes and we want to say in a particular highway, if any one lane, the traffic over that lane is much slower than the other lanes in that highway, it is possible that there is an accident or a blockage in that lane, so we want to basically issue an alert. So, this is something that we have to do all the time, right? We have to basically monitor this, the highways and we want to look at every lane, and if that lane is slower than the other lanes on the highway, we want to issue an alert. Where should we do this? Should we do this processing in Dataflow or should we do this in BigQuery? Think about it for a second. Where would you do the processing to figure out if one lane is slower than others? I hope you said Dataflow. The reason you want to do it in Dataflow is that you want to do this all the time. BigQuery is good for ad hoc. If you want to basically ask, is there a lane blockage right now? That's BigQuery, right? It's an ad hoc thing, somebody walks into your office and says, "Hey, can you look at the data and show me whether there's blockage or not?" That's a dashboard, that's a BigQuery. That's fine. But if it's something that we want to do over, and over, and over again, we want to continuously do it, so any kind of continuous analysis on data is better done in BigQuery, it's better done in Dataflow. So, we're going to be monitoring and we're going to be basically finding that that lane for example is slowed down. So in order to do that, what we're going to do is that we're going to read our sensor data from pub/sub, and then we're going to window it, and compute the average speed at a particular location. The average speed at a particular location is all of the sensors that this particular location, the average is all of these, and then we will find the slowdown and the slowdown will essentially say is that, go ahead and compute the average of all of the sensors at this location and if my speed is slower than everybody else, then there is a slowdown on my lane, and that's basically what that FindSlowdowns is going to do. So, this is basically going to be our pipeline to detect accidents. We're going to get messages from pub/sub, we're going to extract out the data, that's going to be the information associated with the sensor on a lane. We're going to basically take that lane data and send it directly to the detect accidents. This is the current speed at that lane. Meanwhile, we're going to basically go ahead and compute the average speed of all of the sensors at that highway location. So, because we are going to be computing an average, and an average doesn't make sense on an unbounded data stream just by itself, it's not an average historically over the entire dataset because there is no entire dataset. Instead, we're going to compute the average at this location over the past few minutes. So, that's basically why we apply a time window, we compute the average, and now our detect accidents is going to take the average at that location and the actual speed of that lane, it's going to compare it and it's going to basically write down all the accidents to pub/sub. In addition, it's going to take the average speeds and write the average speeds to BigQuery. So, this is basically going to be our complete pipeline. How does that work? Well, we go ahead and find our slowdowns and we write them out.