So having tested, built and established our environment, let's talk about deployment. We'll look at three distinct stages of the pipeline lifecycle, deployment, in-flight and termination. Let's begin with deployment. There are two ways to deploy your data flow job. We can use a direct launch in which we run the pipeline directly from the development environment. For Java pipelines, this means running the pipeline from Gradle or Maven. For Python, this means running the Python script directly. This method can be used for both batch and streaming pipelines. We can also use templates. Templates allow us to launch a pipeline without having access to a developer environment. Templates are built and deployed as an artifact on cloud storage and can be used for both batch and streaming pipelines. Separating the development and execution environments makes it easy to automate your data flow deployments and enable nontechnical users. If you're using an external scheduler like Airflow, you'll be able to use Airflow's built in support for cloud dataflow, which calls a template when invoked. You would supply your pipeline options via the operator arguments. More generically, if you're operating the deployment via CLI tooling, you can provide the options via gcloud commands. You might notice that dataflow SQL is not on this list. Dataflow SQL is a special case of a templated deployment. It's actually a user interface built on top of a flex template. When we're deploying our pipeline for the first time, we need to submit the pipelines at the dataflow service with a unique name. This name will be used by the cloud monitoring tools when you look at the monitoring page in the console. Next, we'll move onto pipeline actions that can be taken on in-flight pipelines. These actions are only available to streaming pipelines since batch jobs can simply be relaunched. As a streaming pipeline processes data, it accumulates state. It's useful to have ways to preserve the state so that we can manage changes to our pipeline without the risk of permanent data loss. We can use snapshots for this. With snapshots, we can save the state of an executing pipeline before launching the new pipeline. This way, if we need to roll back our pipeline to a previous version, we can do that without losing the data processed by the version being rolled back. Since streaming pipelines are long running applications, we're likely to need to modify our pipeline from time to time. Now that we've saved the state of the running pipeline with a snapshot, we can safely update it. When you update up job on a cloud dataflow service, you replace the existing job with the new job that runs your updated pipeline code. The cloud dataflow service retains the job name, but runs the replacement job with an updated job ID. If for any reason you're not happy with how the replacement job is running, you can roll back the prior version by creating a job from a snapshot. Let's explore these two actions a little more deeply. Dataflow snapshots are useful for several scenarios. As mentioned, dataflow snapshots provides a copy of intermediate state of your pipeline at the moment the snapshot is taken. You can use that snapshot to validate a pipeline update. Or use it as a checkpoint for you to roll back your pipeline in the event of an unhealthy release. You can also use snapshots for backup and recovery use cases. We'll explore this in more depth than the reliability module. Lastly, snapshots create a safe path for migrating pipelines to streaming engine. If you want to reap the benefits of smoother auto scaling and superior performance, you can take a snapshot and create a job from that snapshot. The new job will run on streaming engine. The flip side of this is that jobs created from snapshots cannot be run with streaming engine disabled. Let's see how we use dataflow snapshots. We will cover to snapshot workflows, creating snapshots and creating a job from a snapshot. We can create a snapshot in the UI or using the CLI. We can navigate to the job details, page of the pipeline of interest. You'll see a create snapshot button to initiate the snapshot. After you click on the button, you'll be prompted to select if your snapshot will or will not be created with a source snapshot. If you are using pub/sub, creating a snapshot with source will allow you to create a coordinated snapshot between your unread messages and accumulated state. This makes it easier to roll back your pipeline to a known point in time. The pipeline will pause processing while the snapshot is being taken. Depending on how much state is buffered, it could take a matter of minutes. We recommend planning to take snapshots during periods of the day when latency can be tolerated such as nonbusiness hours. You can also create snapshots using the CLI or API. These methods allow you to automate snapshots of your dataflow pipelines which lends itself nicely to scheduling snapshots on a weekly cadence. To create a job from a snapshot, we have to pass in two extra parameters as seen in the sample command here. We have to enable streaming engine with the dash dash enable streaming engine flag. Secondly, we pass in the snapshot ID in to the create from snapshot parameter. If you're creating a job from the snapshot for a modified graph, the new graph must be compatible with the prior job. We'll discuss update compatibility shortly. Now that we snapshotted our pipeline, we're ready to update our pipeline. There are various reasons why you might want to update your cloud dataflow job. One is to enhance or otherwise improve your pipeline code. Another is to fix bugs in your pipeline code. You might also want to update your pipeline to handle changes in the data format. Finally, you may want to change your pipeline to account for version and other changes in your data source. To update your pipeline, you'll need to do a couple of things. First, you need to pass the update and job name options when you submit the new pipeline. You'll have to set job name to the name of the existing pipeline or else the old job will not be replaced. This tells dataflow that you're updating the job rather than deploying a new pipeline. Second, if you added, removed or changed any transform names, you'll need to tell data flow about these changes by providing a transformed name mapping. The replacement job will preserve any intermediate state data from the prior job. Note however, that changing the windowing or triggering strategies will not affect data that's already buffered or already being processed by the pipeline. In-flight data will still be processed by the transforms in your new pipeline. Additional transforms that you add in your replacement pipeline code may or may not take effect depending on where the records are buffered. Updates can also be triggered by the API. This can enable continuous deployment contingent on other tests passing. When you update your job, the dataflow service performs a compatibility check between your currently running job and your potential replacement job. The compatibility check ensures that things like intermediate state information and buffer data can be transferred from your prior job to your replacement job. This means that some changes are not possible with streaming update. Let's review the most common compatibility breaks. Modifying your pipeline without providing a transform mapping will fail the compatibility check. When you update a job, the dataflow service tries to match your transforms from your prior job with your new job so that intermediate state data for each step can be fully processed. If your changes have renamed or removed any steps, you will have to provide a transform mapping so that dataflow can match the state. Adding and removing side inputs will cause the check to fail. Changing coders, the dataflow service isn't able to serialize or deserialize records if you update a job uses different data encoding. Running your job with a new zone and a new region will also cause your compatibility check to fail. Your replacement job must run in the same zone in which you ran your prior job. Lastly, removing state full operations. Dataflow fuses multiple steps together for efficiency. But if you removed a state dependent operation from within a fused step, the check will fail. If your pipeline requires any of these changes, we recommend draining your pipeline then launching a new job with the updated code. Now that we've discussed actions you can take on your streaming pipeline, we'll discuss two ways that you can terminate your pipeline. First, we start with drain. Selecting drain will tell the dataflow service to stop consuming messages from your source and finished processing all buffered data. After the last record is processed, the data flow workers are torn down. This action is only applicable to streaming pipelines. Secondly, we can cancel the job. Using the cancel option ceases all data processing and drops any intermediate unprocessed data. We can cancel both batch and streaming jobs. Let's take a closer look at both of these options. We can terminate our job from the UI. When you navigate to the job details page of your job, you'll find a stop button in the menu bar. You'll be prompted between two options for stopping your job, canceling your job or draining the job. Let's explore each of these options. When we drain the pipeline, it will stop pulling data from the source and it will finish processing data that has already been read into the pipeline. This provides an advantage from canceling your job outright since no record is dropped. When you relaunch your streaming job, it will continue to process unacknowledged messages from your source. However, when a streaming pipeline is drained, the watermark is moved to infinity, which closes all windows. Closing all the windows in this way will result in incomplete aggregations since draining the pipeline will not wait for open windows to be closed before stopping pulling from the source system. Consider the impact of incomplete aggregations on downstream systems when draining your pipelines. Beam attaches a PaneInfo object that provides information about the pane and element it belongs to as every pain is implicitly associated with a window. You can use PaneInfo to identify incomplete windows and choose to write that data elsewhere, which can save you the hassle of reconciling incomplete aggregations with your production dataset. When you cancel a job, dataflow will immediately begin shutting down the resources associated with your job. The pipeline will stop pulling and processing data immediately, which means you may lose any data that was still being processed when the pipeline was cancelled. If your use case can tolerate data loss, then canceling your job will fit your purpose. So to summarize the lifecycle of a streaming pipeline, let's review our deployment options. First, if it's the first time deploying the pipeline, there's no existing state to consider. So you just deploy the pipeline. If there's an existing pipeline and you want to update it, you should take a snapshot of your pipeline. This ensures that you have a working state you can revert your pipeline to if you observe an issue with your new deployment. Once you've taken a snapshot, you're ready to update your job. You need to account for any changes to the names of the pipelines transformations by providing the mapping from old names to the new. If the updated pipeline is compatible, the update will succeed. And you'll get a new pipeline in place of the old without losing the state of the previous version of the pipeline. If the update is not possible, then you'll need to choose between drain and cancel options. If you can replay the source, then you can choose to cancel the pipeline, which will drop any in-flight data. You can then deploy the new pipeline and replay the data from the source. Note that we cannot use the dataflow snapshot if the pipeline modifications are not update compatible. But taking a snapshot of the source with a pub/sub snapshot, you can minimize unnecessary reprocessing. If replay is not possible, then you can drain the pipeline. This will not lose data, but you may end up with incomplete aggregations in your output sync. Your downstream business logic should inform the appropriate approach to handling this. Once the pipeline has been drained, you can relaunch the pipeline. This is the end of the module. You should now be able to execute test approaches with your dataflow pipeline, snapshot and update your dataflow pipelines, and drain and cancel your dataflow pipelines.