You may be thinking at this point, TFX abstracts away data processing for me. Why do I need to learn another framework? I want to convince you that Beam is an abstraction worth learning. Because of it's incredible flexibility and unification of batch and stream data processing that underpins TFX pipelines. My advice to you is to learn Beam, write your code once, and don't worry about scaling your data processing again. Apache Beam provides a unified framework for running batch and streaming data processing, jobs that run on a variety of execution engines. Several TFX libraries use Beam for running tasks which enable a high degree of scalability across compute clusters. Beam includes support for a variety of execution engines or runners, including a direct runner that runs on a single compute node. This is very useful for development, testing, or small deployments. Beam provides an abstraction layer that enables TFX to run its supported data runners without code modifications. Let's see how that Beam programming model translates to scalable data processing and TFX pipelines. Recall that Apache Beam skills the computation of many of the underlying libraries for authoring TFX components. TensorFlow Data Validation is built on the Beam programming model for batch computation of whole data-set statistics. TensorFlow Transform, is also built on the Beam programming model for batch feature engineering and feature transformations. TensorFlow Model Analysis is built on the Beam programming model for batch and streaming model evaluation metrics across data splits and slices. In addition, Beam also powers BulkInference via the BulkInferer component. The beam software development kit can work with numerous runtimes, including Full Operating System, Python, or Docker Containers and translate that to distributed compute cluster via runner. Components wrap-up component executer code in Beam pipelines, send them off and get the results back across multiple different runners such as Spark, Flink, and Google Cloud dataflow. Beam abstracts away an incredible amount of complexity to distribute your pipelines data processing, which enables you to focus on improving your model's performance and delivering business impact. Apache Beam provides a portable API to TFX for building sophisticated data-parallel processing pipelines across a variety of execution engines or runners. It brings a unified framework for batch and streaming data that balances correctness, latency, and costs and large unbounded out of order, and globally distributed data-sets. Apache Beam contains several key primitives. A pipeline object and its accompanying set of pipeline execution options. A PCollection abstraction, which represents a potentially distributed multi-element data-set. You can think of a PCollection as pipeline data. Beam transforms use PCollection objects as inputs and outputs. PTransforms, change, filter, group, analyze, or otherwise process the elements in a collection. A transform creates a new output PCollection without modifying the input collection. A typical pipeline applies subsequent transforms to each new output PCollection in turn, until processing is complete. ParDO is the core parallel processing transform in the Apache Beam software development kit, invoking a user-specified function on each of the elements in an input PCollection independently and possibly in parallel. Think of PCollections as variables and PTransforms as functions apply to these variables. The shape of the pipeline can be arbitrarily complex processing graph IO transforms. The final transform PCollections to an external source. Beam provides read and write transforms for several common data storage types. Lastly, runners are the actual software that except pipelines and translate them into massively parallel big data processing systems. TFX runners inherit from a specialized TFX runner class that the runners for local testing and debugging, and those for large-scale systems such as the Kubeflow DAG runner for Google Cloud. Dataflow is the Apache Beam runner for Google Cloud and supports its most advanced functionality, such as event times when doing watermarks and triggering with full management of resources and low-level details. For more information, see the Google Cloud Dataflow documentation on the large-scale processing capabilities you can bring to your TFX pipelines. The example shown on the screen is a TFX pipeline mapped to Beam primitives. Let's step through each action. First, Beam pipeline is created in and IO transform in just raw data into a Pcolection. Second, Beam transform maps a user textfile decoding function to the text data Pcollection. Third, TensorFlow transform, analyze and transform data-set function applies a user-defined pre-processing function for feature engineering. Finally, the transform texts Pcollection is written out as TF records. Each beam supported component is translated into a Beam graph. When a Beams job is running on Google Cloud, you can go into the dataflow job UI and see the execution at each component job as shown in these examples. On the left is an exampleGen component expressed as a Beam pipeline that ingests data and splits it into train and evaluation sets. On the right is a small Beam pipeline for the transform component, which is doing feature pre-processing. Each split represents different feature pre-processing, such as normalization tokenizations and embeddings computed in parallel. Beam abstracts away an incredible amount of complexity to distribute your TFX pipeline data processing into it distributed graph. The diagram to the right shows what an entire TFX Beam pipeline looks like with all the standard components added. As you can see, it barely fits on the slide. The TFX BSL library is the horizontal layer of abstraction to extend or create new TFX Beam pipeline operations, which can interact with ML metadata and Cloud Storage for metadata operations. Compare this with the Beam orchestrator code, which is quite different. It abstracts away the pipeline details in favor of runtime arguments. Orchestrator runners have similar abstraction levels, so you can easily switch between your runners during development with minimal changes to the runtime arguments supported by that runner.