This module discusses what stream processing is, how it fits into a big data architecture when stream processing makes sense, and also the challenges associated with streaming data processing. As this module is all about streaming, I'll be discussing that part of the reference architecture. Data typically comes in through the Pub/Sub, then the data goes through aggregation and transformation in Dataflow. Then you use BigQuery or Cloud Bigtable, depending on whether the objective is to write aggregates or individual records coming in from streaming sources. Let's look at streaming ideas first. Why do we stream? Streaming enables us to get real-time information in a dashboard or another means to see the state of your organization. In the case of the New York City Cyber Command, Noam Dorogoyer stated the following, we have data coming from external vendors, and all this data is ingested through Pub/Sub, and Pub/Sub pushes it through to Dataflow which can parse or enrich the data. If data comes in late, especially when it comes to cybersecurity, it's no longer valuable, especially during an emergency. From a data engineering standpoint, the way we constructed the pipeline, is to minimize latency at every single step. If it's maybe a Dataflow job, we designed it so that as many elements as possible are happening in parallel, so at no point is there a step that's waiting for a previous one. The amount of data flowing through the Cyber Command varies each day. Dorogoyer said that on weekdays, during peak times, it could be five or six terabytes. On weekends, that can drop to two or three terabytes. As the Cyber Command increases visibility across agencies, it will deal with petabytes of data. Security analysts can access the data in several ways. They run queries in BigQuery, or use other tools that will provide visualizations of the data, such as Google Data Studio. Streaming is data processing on unbounded data. Bounded data is data at rest. Stream processing is how you deal with unbounded data. A streaming processing engine provides low latency, speculative or partial results, the ability to flexibly reason about time, controls for correctness, and the power to perform complex analysis. You can actually use streaming to get real-time data warehouses, and then create a dashboard of real-time information. For example, you could see in real-time, the positive versus negative tweets about your company's product. Use it to detect fraud, use for gaming events, or for finance back-office apps such as stock trading, anything dealing with markets, et cetera. When you look at the challenges associated with streaming applications, you're talking about the three V's; volume, velocity, and variety of data. Volume is a challenge because the data never stops coming and quickly grows. Velocity, depending on what you are doing, trading stocks, tracking financial information, opening subway gates, you can have tens of thousands of records per second being transferred. Velocity can be very variable as well. For example, if you are a retailer designing your points of sale system nationwide, you are probably going to carry along at a reasonably steady volume all year until you get to Black Friday, then sales and data being transferred go through the roof. It is important to design systems that can handle that extra load. Variety of data is the third challenge. If you are using only structured data, data coming from a mobile app, that is easy enough to handle, but what if you have unstructured data like voice data or images? These are streaming records, and in some cases, a null value might be used to deal with that type of unstructured data. We're going to look at how streaming in the Cloud can help us here. On the volume side, we will look at a tool to assist in autoscaling, processing and analysis, so that the system can handle the volume. On the velocity side, we will look at a tool that can handle the variability of the streaming process. On the variety side, we will look at how artificial intelligence can help us with unstructured data. The three products you are going to examine here are Pub/Sub, which will allow you to handle changing and variable volumes of data, Dataflow, which can assist in processing data without undue delays, and BigQuery, which you will use for your ad hoc reporting, even on streaming data. Let's take a look at the steps that happen. First, some data is coming in, possibly from an app, a database, or an Internet of Things, or IoT. These are generating events. Then an action takes place. You are going to ingest those and distribute those with Pub/Sub. This will ensure that the messages are reliable. This will give you buffering. Dataflow then is what aggregates, enriches, and detects the data. Next, you will write into a database of some kind, such as BigQuery or Bigtable, or maybe run things through a machine learning model. For example, you might use this streaming data as it is coming in to train a model in vertex AI. Then finally, Dataflow or Dataproc could be used for batch processing, back filling, et cetera. This is a pretty common way to put things together in Google Cloud.