Designing data processing systems includes designing flexible data representations, designing data pipelines, and designing data processing infrastructure. You're going to see that these three items show up in the first part of the exam with similar but not identical considerations. The same questions or interest show up in different contexts, data representation, pipelines, processing infrastructure. For example, innovations in the technology could make the data representation of a chosen solution outdated, the data processing pipeline might have been implemented in a very involved transformations now available as a single efficient command, and the infrastructure could be replaced by a service with more desirable qualities. However, as you'll see, there are additional concerns with each part. For example, system availability is important to pipeline processing, but not data representation, and capacity is important to processing, but not the abstract pipeline or the representation. Think about data engineering and Google Cloud as a platform consisting of components that can be assembled into solutions. Let's review the elements of GCP that form the data engineering platform. Storage and databases, services that enable storing and retrieving data, different storage and retrieval methods that make them more efficient for specific use cases. Server-based processing, services that enable application code and software to run that can make use of stored data to perform operations, actions, and transformations producing results. Integrated services, combined storage and scalable processing in a framework designed to process data rather than general applications, more efficient and flexible than isolated server database solutions. Artificial intelligence, methods to help identify, tag, categorize, and predict three actions that are very hard or impossible to accomplish in data processing without machine learning. Pre and post-processing services, working with data and pipelines before processing, such as data cleanup, or after processing, such as data visualization. Pre and post-processing are important parts of a data processing solution. Infrastructure services, all the framework services that connect and integrate data processing and IT elements into a complete solution. Messaging, systems, data import, export, security, monitoring, and so forth. Storage and database systems are designed and optimized for storing and retrieving. They're are not really built to do data transformation. It's assumed in their design that the computing power necessary to perform transformations on the data is external to the storage or database. The organization method and access method of each of these services is efficient for specific cases. For example, a Cloud SQL database is very good at storing consistent individual transactions, but it's not really optimized for storing large amounts of unstructured data like video files. Database services perform minimal operations on the data within the context of the access method, for example, SQL queries can aggregate, accumulate, count, and summarize results of a search query. Here's an exam tip, know the differences between Cloud SQL and Cloud Spanner and when to use each. Service differentiators include access methods, the cost or speed of specific actions, sizes of data, and how data is organized and stored. Details and differences between the data technologies are discussed later in this course. An exam tip, know how to identify technologies backwards from their properties. For example, which data technology offers the fastest ingestive data? Which one might you use for ingestive streaming data? Managed services are ones where you can see the individual instance or cluster. Exam tip, managed services still have some IT overhead. It doesn't completely eliminate the overhead or manual procedures, but it minimizes them compared with on-prem solutions. Serverless services remove more of the IT responsibility, so managing the underlying servers is not part of your overhead and the individual instances are not visible. A more recent addition to this list is Cloud Firestore. Cloud Firestore is a NoSQL document database built for automatic scaling. It offers high performance and ease of application development, and it includes a data store compatibility mode. As mentioned, storage and databases provide limited processing capabilities, and what they do offer is in the context of search and retrieval. But if you need to perform more sophisticated actions and transformations on the data, you'll need data processing software and computing power. So where do you get these resources? You could use any of these computing platforms to write your own application or parts of an application that you storage your database services. You could install open-source software such as MySQL, an open-source database, or Hadoop, an open source data processing platform on Compute Engine. Build-your-own solutions are driven mostly by business requirements. They generally involve more IT overhead than using a Cloud platform service. These three data processing services feature in almost every data engineering solution. Each overlaps with the other, meaning that some work could be accomplished in either two or three of these services. Advanced solutions may use one, two or all three. Data processing services combine storage and compute and automate the storage and compute aspects of data processing through abstractions. For example, in Cloud Dataproc, the data abstraction with Spark is a resilient distributed dataset, or RDD, and the processing abstraction is a directed acyclic graph, DAG. In BigQuery, the abstractions are table and query, and in Dataflow, the abstractions are PCollection and pipeline. Implementing storage and processing as abstractions enables the underlying systems to adapt to the workload, and the user data engineer to focus on the data and business problems that they're trying to solve. There's great potential value and product or process innovation using machine learning. Machine learning can make unstructured data, such as logs useful by identifying or categorizing the data and thereby enabling business intelligence. Recognizing an instance of something that exist is closely related to predicting a future instance based on past experience. Machine learning is used for identifying, categorizing, and predicting. It can make unstructured data useful. Your exam tip is to understand the array of machine learning technologies offered on TCP, and when you might want to use each. A data engineering solution involves data ingest, management during processing, analysis, and visualization. These elements can be critical to the business requirements. Here are a few services that you should be generally familiar with. Data transfer services operate online and a data transfer appliance is a shippable device that's used for synchronizing data in the Cloud with an external source. Cloud Data Studio is used for visualization of data after it has been processed. Cloud Dataprep is used to prepare or condition data and to prepare pipelines before processing data. Cloud Datalab is a notebook that is a self-contained workspace that holds code, executes the code, and displays results. Dialogflow is a service for creating chatbots. It uses AI to provide a method for direct human interaction with data. Your exam tip here is to familiarize yourself with infrastructure services that show up commonly in data engineering solutions. Often they're employed because of key features they provide. For example, Cloud Pub/Sub can hold a message for up to seven days providing resiliency to data engineering solutions that otherwise would be very difficult to implement. Every service in Google Cloud platform could be used in a data engineering solution. However, some of the most common and important services are shown here. Cloud Pub/Sub, a messaging service, features in virtually all live or streaming data solutions because it decouples data arrival from data ingest. Cloud VPN, Partner Interconnect or Dedicated Interconnect, play a role whenever there's data on premise, it must be transmitted to services in the Cloud. Cloud IAM, firewall rules, and key management are critical to some verticals, such as the health care and financial industries. Every solution need to be monitored and managed, which usually involves panels displayed in Cloud Console and data sent to Stackdriver monitoring. It's a good idea to examine sample solutions that use data processing or data engineering technologies and pay attention to the infrastructure components of the solution. It's important to know what the services contribute to the data solutions and to be familiar with key features and options. There are a lot of details that I wouldn't memorize, for example, the exact number of IAP supported by a specific instance is something I would expect to look up and not know. Also, the cost of a particular instance type compared with another instance type, the actual values, is not something I would expect I'd need to know as a data engineer. I would look these details up if I needed them. However, the fact that an enforce standard instance has higher IAPs than an N1 standard instance, or that the N4 standard cost more than an N1 standard, are concepts that I would need to know as a data engineer.