With Windows, you decide where you put the message, but you now need to make another additional decision. When is the window going to emit the results? At the first glance, you may decide just to meet the resource when the window closes. This is very intuitive if you have a fixed window. But in other situations like session windows, it might not be so obvious. In addition to this, you will also receive late data. So you need to decide how to trigger output in the case of late data. How is late data defined? If you're windowing by event time, your messages will be within the boundaries of the window. How did you decide if a message is late or that you have waited long enough for late data? This is where the concept of a watermark becomes useful. Let's focus on how windows work when there is no latency and no late data. In an ideal world, if there were no latency and everything was instantaneous, then this fixed time windows would just flush at the close of the window. At the very microsecond that they time x2 begins, a one minute window terminates and flushes all the data. But this is only if there is no latency. But in the real world, in a streaming pipelines, the order of our data will be altered. Even if you receive data in perfect order when it is processing the pipeline in a distributed system, different messages will take different processing times and that order will be lost. How can you decide that the window can be closed if the data's out of order? How can you be sure that no farther and older messages will be received? In a streaming pipelines, there are two dimensions of time. The relationship between the two defines what is called the watermark. The watermark is the relationship between the processing timestamp and the event timestamp. The processing timestamp is the moment the message arrives at the pipeline. Ideally the bolts should be the same with no delays, however, this regularly happens. There are always delays, latencies and so on. Any message that arise before the watermark is said to be early. This happens too, if it arrives right after the watermark it's said to be on time, and if it arrives later then it is late data. The watermark is what defines whether a message is early or late. The watermark can not be calculated because it depends on messages we have not yet seen. Data flow estimates the watermark as the oldest timestamp waiting to be processed. This estimation is continuously updated with every new message that is received. Now, why do you need to keep two-dimensions of time and a watermark definition. Let's see how watermarks help decide when a window is complete and you can proceed with your calculations. In a real-world setting, data will always arrive with some lag time. This lag time is the difference between when the data was expected and when the data is actually arriving. This deviation from the ideal expectation is what we call the watermark therefore we keep track of the lack of every message and we try to predict the value of the watermark. Let's say the lack in the future when the timestamp of the last message is after or at the value of the watermark, then it means that the window can be considered complete. Any message received after this moment will be considered late. In this example, data one is late because it is arriving much later than when it was expected. That is, it's arriving much later than the water mark. Data is only late when it's compared to the watermark. It doesn't make sense to talk about late data unless we have a watermark. Data flow will wait until the watermark is trespassed to close the window. Well, it actually waits for some additional time as a form of buffer but after that, the windows flash and the result is emitted. Any message coming after this moment will be considered late. You will need to make a decision about what to do with late data. The default behavior is to drop late data but as you will see in the trigger section, you can choose to wait for late data and emit results again if there are any late messages. When you run a streaming pipeline in Data flow, the job metrics page in Data flow contains some details about the watermark values. The data freshness metric is actually related to the watermark of your input data. When you are processing fresh data, the data freshness metric value decreases when the windows close and those messages are considered now fully processed. Data has not waited until it starts being processed by Data flow. In this situation the watermark will be close to real-time. Data freshness is the difference between real time and the time stamp of the oldest message waiting to be processed. The watermark is actually the timestamp of the oldest message that has not been processed yet. Data freshness is a measurement of how far the oldest messages is far from the current moment. When you see a monotonically increasing value, it means that data has to wait at the input for more time, waiting to start being processed. There could be two reasons for the additional wait. It could be because your pipeline is busy processing messages, or it could be because the input has increased very quickly and data is accumulating at the input or it could be because both. How can we distinguish between both situations? For that, system latency is a useful metric. System latency measures the time it takes to fully process a message. This includes any waiting time in the input source. If for some reason the pipeline needs more time to process a message, then system latency will increase. For instance, because the pipeline is busy processing a complex message. When system latency keeps increasing and data freshness keeps increasing too, it means that the pipeline cannot process more messages until it does not finish processing the current messages. You are not necessarily receiving a lot of more messages. The pipeline is just taking longer to process the current messages but if system latency remains constant or reduces and does not monotonically increase and data freshness value is monotonically increasing, that could be because there are many more messages at the input. For instance, we have received like a peak at the input. The pipeline is processing data at the same pace so latency doesn't increase. Unless Data flow adds more workers to the pipeline, you will not be able to catch up with the input peak and data freshness increases. If you're running without the scaling in this situation, Data flow will spin up more workers to process these additional data received at the input. Although we don't get actual watermarked values from Data flow with the freshness and latency metrics, we can reason about the situation of our pipeline and diagnose if we're getting more input or if the pipeline is busy doing more calculations. Data flow itself uses these metrics to decide when to upscale or downscale to adapt the amount of resource use to the actual demand of data processing. The ideal situation for a streaming pipeline is to have both stable data freshness and system latency values. If data freshness monotonically increases and latency doesn't increase, that's evidence that you are receiving more input data. Data flow will spin up new workers because of the size of the backlog to be processed is increasing. If latency increases and data freshness is stable, then messages are taking more time in the pipeline to be processed. CPU usage will probably increase so Data flow will spin up new workers. You may also see other situations that increase latency but no CPU usage. For instance, if you're accessing an external service or API and the service may be overloaded and taking a long time to respond. CPU usage will not be high, but the latency will increase. In that situation, auto scaling would not create more workers. They would be useless any ways to accelerate the pipeline in the situation. If both metrics monotonically increase, then your backlog is increasing and your pipeline is also taking more time to process 16 messages. Auto scaling should create more workers to adapt to the increasing demand.