Let's start by identifying some of the challenges associated with processing streaming data. Dataflow, as we already know, provides a serverless service for processing batch and streaming data. It is scalable and for streaming has a low latency processing pipeline for incoming messages. We've discussed that we can have bounded and unbounded collections. Now, we are examining an unbounded pipe, that results from a streaming job. All of the things we've done thus far, like branching merging, we can do as well with dataflow for streaming pipelines. However, now every step of the pipeline is going to act in real time on incoming messages rather than in batches. What are some of the challenges with processing streaming data? One challenge is scalability. Being able to handle the volume of data as it gets larger and or more frequent. The second challenge is fault tolerance. The larger you get, the more sensitive you are to going down unexpectedly. The third challenge is the model being used, streaming or repeated batch. Another challenge is timing or the latency of the data, for example what if the network has a delay or a sensor goes bad and messages can't be sent. Additionally, there is a challenge around any aggregation you might be trying to do. For example, if you are trying to take the average of data, but it is in a streaming scenario, you cannot just plug values into the formula for an average, the sum from one to n, because n is an ever-growing number. In a streaming scenario, you have to divide time into windows, and we can get the average within a given window. This can be a pain if you have ever had to write a system like this. You can imagine, it can be difficult to maintain windowing, time rolled threads, etc. The good news is the dataflow is going to do this for you automatically. In dataflow, when you are reading messages from Pub Sub, every message will have a timestamp that is a Pub Sub message timestamp, and then you'll be able to use this timestamp to put the data into the different time windows and aggregate all of those windows. Message ordering matters. There may be a latency between the time that a sensor is read, and the message is sent. You may need to modify the timestamps if this latency is significant. If you want to modify a timestamp and have it based on some property of your data itself, you can do that. Every message that comes in, for example, the sensor provides its own date timestamp as part of the message. Elements source as a default date timestamp or DTS, which is the time of entry to the system rather than the time the sensor data was captured. A P transform extracts the date timestamp from the data portion of the element and modifies the DTS metadata, so the time of data capture can be used in window processing. Here's the code used to modify the date timestamp by replacing the message timestamp with a timestamp from the element data. If Pub Sub I0 is configured to use custom message IDs, dataflow de-duplicates messages by maintaining a list of all the custom IDs it has seen in the last 10 minutes. If a new message is ID is in the list, the message is assumed to be a duplicate and discarded.