Let's start by exploring dataflow in more detail. The reason dataflow is the preferred way to do data processing on google cloud is that dataflow is serverless. You don't have to manage clusters at all. Unlike with dataproc the auto scaling and dataflow scale step by step, it's very fine grained. Plus as we will see in the next course, dataflow allows you to use the same code for both batch and stream, this is becoming increasingly important. When building a new data processing pipeline, we recommend that you use dataflow. If on the other hand, you have existing pipelines written using Hadoop technologies, it may not be worth it to rewrite everything. Migrate it over to google cloud using dataproc and then modernize it as necessary. As a data engineer, we recommend that you learn about dataflow and data proc and make the choice based on what's best for a specific use case. If the project has existing Hadoop or spark dependencies, use dataproc. Please keep in mind that there are many subjective issues when making this decision and that no simple guide will fit every use case. Sometimes, the production team might be much more comfortable with a DevOps approach where they provision machines than with a serverless approach. In that case too, you might pick dataproc, if you don't care about streaming and your primary goal is to move existing workloads then dataproc would be fine. Dataflow however, is our recommended approach for building pipelines. Dataflow provides a serverless way to execute pipelines on batch and streaming data. It's scalable, to process more data, dataflow will scale out to more machines, it will do this transparently. The stream processing capability also makes it low latency. You can process the data as it comes in. This ability to process batch and stream with the same code is rather unique, for a long time batch programming and data processing used to be two very separate and different things. Batch programming dates to the 1940's and the early days of computing where it was realized that you can think of two separate concepts, code and data. Use code to process data. Of course both of these were on punch-cards, so that's what you were processing, a box of punch-cards called a batch. It was a job that started and ended when the data was fully processed. Stream processing on the other hand, is more fluid. It arose in the 1970's with the idea of data processing being something that is ongoing. The idea is the data keeps coming in and you process the data. The processing itself tended to be done in micro batches. The genius of BEAM is that it provides abstractions that unify traditional batch programming concepts and traditional data processing concepts. Unifying programming and processing is a big innovation in data engineering. The four main concepts are PTransforms, PCollections, pipelines, and pipeline runners. A pipeline identifies the data to be processed and the actions to be taken on the data. The data is held in a distributed data abstraction called PCollection. The PCollection is immutable. Any change that happens in a pipeline ingests one PCollection and creates a new one. It does not change the incoming PCollection. The action or code is contained in an abstraction called a PTransform. PTransform handles input, transformation and output of the data. The data in a PCollection is passed along a graph from one PTransform to another. Pipeline runners are analogous to container hosts such as google, kubernetes engine. The identical pipeline can be run on a local computer, data center VM or on a service such as dataflow in the cloud. The only difference is scale and access to platform specific services. The services the runner uses to execute the code is called a backend system. Immutable data is one of the key differences between batch programming and data processing. Immutable data where each transform results in a new copy means there is no need to coordinate access control or sharing of the original ingest data. So it enables or at least simplifies distributed processing. The shape of a pipeline is not actually just a singular linear progression, but rather a directed graph with branches and aggregations. For historical reasons, we refer to it as a pipeline, but a data graph or dataflow might be a more accurate description. A PCollection represents both streaming data and batch data, there is no size limit to a PCollection. Streaming data is an unbounded PCollection that doesn't end. Each element inside a PCollection can be individually accessed and processed. This is how distributed processing of the PCollection is implemented. So you define the pipeline and the transforms on the PCollection and the runner handles implementing the transformations on each element, distributing the work as needed for scale and with available resources. Once an element is created in a PCollection it is immutable, so it can never be changed or deleted. Elements represent different data types. In traditional programs, a data type is stored in memory with a format that favors processing. Integers in memory are different from characters which are different from strings and compound data types. In a PCollection, all data types are stored in a serialized state as byte strings. This way there is no need to serialize data prior to network transfer and de serialize it when it is received. Instead, the data moves through the system in a serialized state and is only deserialized when necessary for the actions of a PTransform.