Welcome to Module 2: Pipelines and Activities. In this module, we will explain what a data pipeline is, how it relates to Azure SQL Data Warehouse, a brief introduction on Azure storage and processing pipelines related to activities in the Azure Data Factory, and also analyze logs for HDInsight clusters, and using Blob storage on HDInsight. Azure Data Factory is the platform that solves many data scenarios. It is a Cloud-based ETL and data integration service that allows you to create data driven workflows for orchestrating data movement and transforming data at scale. Using Azure Data Factory, you can create unscheduled data-driven workflows called pipelines that ingest data from disparate data sources. You can build complex ETL processes that transform data visually with data flows or by using Compute services such as Azure, HDInsight, Hadoop, Azure Databricks, and Azure SQL Database. Additionally, you can publish your transformed data to data stores such as Azure SQL Data Warehouse for business intelligence applications to consume. Ultimately through Azure Data Factory, raw data can be organized into meaningful data stores and data lakes for better business decisions. Enterprises have data of various types that are located in disparate sources: On-Prem, in the Cloud, structured, unstructured, semi-structured, and all arriving at different intervals and speeds. The first step in building an information production system is to connect all the required sources of data and processing, such as Software As A Service services, databases, file shares, and FTP web services. The next step is to move the data as needed to a centralized location for subsequent processing. Without Data Factory, enterprises must build custom data movement components or write custom services to integrate these data sources in processing. It is expensive and hard to integrate and maintain such systems. With Data Factory, you can use the copy activity in a data pipeline to move the data from both On-Premises and Cloud sources to a centralized data store in the Cloud for further analysis. For example, you can collect data in Azure Data Lake Storage and transform the data later by using an Azure Data Lake Analytics Compute service. You can also collect data in Azure Blob Storage and transform it later by using an Azure HDInsight Hadoop cluster. A Data Factory can have one or more pipelines. A pipeline is a logical grouping of activities that together perform a task. The pipeline allows you to manage the activities as a set instead of each one individually. You deploy and schedule the pipeline instead of the activities independently. The activities in a pipeline define actions to perform on your data. A typical ETL process collects and refines different types of data, then delivers the data to a data warehouse such as Redshift, Azure, or BigQuery. ETL also makes it possible to migrate data between a variety of sources, destinations, and analysis tools. As a result, the ETL process plays a critical role in producing business intelligence and executing broader data management strategies. Three steps make up the ETL process and enable data to be integrated from source to destination. These are data extraction, data transformation, and data loading. The first step is extraction. Few businesses rely on a single data type per system. Most manage data from a variety of sources and use a number of data analysis tools to produce business intelligence. To make a complex data strategy like this work, the data must be able to travel freely between systems and apps. Before data can be moved to a new destination, it must first be extracted from its source. In this first step of the ETL process, structured and unstructured data is imported and consolidated into a single repository. Raw data can be extracted from a wide range of sources. The next step is transformation. During this phase of the ETL process, rules and regulations can be applied to ensure data quality and accessibility. You can also apply rules to help your company meet reporting requirements. The process of data transformation is comprised of several sub processes. The first is cleansing. Inconsistencies and missing values in the data are resolved. Next is standardization. Formatting rules are applied to the dataset. Next is deduplication. Redundant data is excluded or discarded. After that comes verification. Unusable data is removed and anomalies are flagged. Then sorting. Data is organized according to type. Then finally are any other tasks such as additional or optional rules that can be applied to improve data quality. The final step in the ETL process is to load the newly transformed data into a new destination. Data can be loaded all at once, a full load, or at scheduled intervals, an incremental load. In an ETL full loading scenario, everything that comes from the transformation assembly line goes into new unique records in the data warehouse. Though there may be times this is useful for research purposes, full loading produces datasets that grow exponentially and can quickly become difficult to maintain. A less comprehensive but more manageable approach is incremental loading. Incremental loading compares incoming data with what's already on hand, and only produces additional records if new and unique information is found. This architecture allows smaller, less expensive data warehouses to maintain and manage business intelligence. Data pipeline and ETL are often used interchangeably. ETL systems extract data from one system, transform it and load the data into a database or data warehouse. Legacy ETL pipelines typically run in batches, moving data in one large chunk at a specific time to the target system, usually at regularly scheduled intervals. Data pipeline encompasses ETL as a subset. It refers to a system for moving the data from one system to another. This data may or may not be transformed and may be processed in real-time instead of in batches. While streamed, it is processed in a continuous flow. The data doesn't have to be loaded to a database or data warehouse, it can be loaded to any number of targets. Let's discuss the different types of data in a pipeline, the evolution of data pipelines, and walk through an example pipeline implemented on Azure using batch, real-time, Cloud-native, and open source. Before deploying a data pipeline, you'll want to answer the following questions; who owns the data pipeline? In a small organization, a data scientist may be responsible for the pipeline, while larger organizations usually have an Infrastructure team that is responsible for keeping the pipeline operational. Which teams will be consuming data? It's useful to know which teams will be consuming the data so that you can stream the data to the appropriate teams. For example, marketing may need real-time data of lending page visits to perform attribution for marketing campaigns. Who will QA the pipeline? The data quality of events passed to the pipeline should be thoroughly inspected on a regular basis. Sometimes a product update will cause a tracking event to drop relevant data and a process should be setup to capture these types of changes. Data in a pipeline is often referred to by different names based on the amount of modification that has been performed. Data is typically classified with the following labels: Raw data is tracking data with no processing applied. This is data stored in the message encoding format used to send tracking events such as JSON. Raw data does not yet have a schema applied. It's common to send all tracking events as raw events, because all events can be sent to a single endpoint and schemas can be applied later on in the pipeline. Processed data is raw data that has been decoded in the event-specific formats with the schema applied. For example, JSON tracking events that have been translated into a session start event with a fixed schema are considered processed data. Processed events are usually stored in different event tables and destination in a data pipeline. Processed data that has been aggregated or summarized is referred to as cooked data. For example, processed data could include session start and session into events, and could be used as input to cook data that summarizes daily activity for a user such as number of sessions and total time on site for a web page. Data scientists will typically work with processed data and use tools to create cooked data for other teams. Just as the term applies, batch processing strategies involve the saving of data until a specific time and processing it together. Batch processing updates might happen during off-peak hours for massive nightly uploads, but they can also happen on an hourly, even by minute basis when dealing with smaller batches. On-site data warehouse servers are for the most part falling to the wayside. That's because Cloud-based data warehouses are more cost-effective and require zero maintenance for the user. Software service providers have also moved to the Cloud or begun to offer Cloud services. Cloud-native tools are tools optimized to work with Cloud-based data, such as from AWS buckets. They are hosted in the Cloud, which allows you to save money on infrastructure and expert resources because you can rely on the infrastructure and expertise of the vendor hosting your pipeline. There's nothing better than free, and that's what you get with open source ETL tools. Most open source ETL tools assist with the management of batch processing and streaming scheduled workflows. Scheduled workflow ETL technology like Apache Kafka and Apache Airflow allows you to automate the streaming of information from one data system to another. When building a data warehouse for machine learning insights, these workflows are essential. Apache Airflow is one of the most popular of these open source workflow automation resources that uses Directed Acylic Graphs to support your ETL pipelines. Apache Airflow is also useful when building data pipelines to a data warehouse for machine learning analysis, because it includes the hooks and operators required to connect with AWS and Google Cloud, the two most popular data warehousing services. Azure SQL database is a SQL Server in the Cloud, and because it runs in the Cloud, you don't have to worry about maintaining any infrastructure, tweaking database files or patching operating systems. You also don't have to worry about SQL Server licenses. That is all included in the Azure SQL pricing, just spin it up and use it. These benefits are some of the reasons why running your app in the Cloud is incredibly beneficial. Azure SQL can do almost anything that On-premise SQL Server can do with a few exceptions. For instance, Azure SQL doesn't have a SQL Agent running as there were other services in Azure that you can use for that, like Azure Data Factory. Azure SQL databases offer a lot of intelligent features. Here are some of them. Dynamic data masking, this enables you to mask sensitive data like credit card numbers. The data itself doesn't change but is obfuscated when it is retrieved by users that aren't allowed to see it. This is a great feature for using production data in your test or even development environment. Geo-replication, his enables you to replicate your complete database to a database in another geographic region. This replication happens live and has a maximum lag of five minutes, but is much faster in practice. Automatic tuning, this analyzes the queries that are executed by Azure SQL databases and determines if something needs to happen to increase performance. It can scale for you tune the indexes on your database and tune history logs. It even checks if the action that is performed helped and if it didn't, reverses it. This is much more effective than manual performance tuning in SQL Server. Azure SQL offers many other features like SQL threat detection, data encryption at rest, and Azure AD integration. It is a very mature and advanced service. So what is Azure SQL Data Warehouse? Well, it is the SQL Server Data Warehouse feature in the Cloud. SQL Server Data Warehouse exits On-premises as a feature of SQL Server. In Azure, it is a dedicated service that allows you to build a data warehouse that can store massive amounts of data, scale up and down and is fully managed. As with Azure SQL database, Azure SQL Data Warehouse is something that you just spin up. You don't have to worry about infrastructure or licenses. Azure SQL Data Warehouse is often used as a traditional data warehouse solution. This means that you would put massive amounts of data in it using a data schema of tables and columns that you have designed. Azure SQL Data Warehouse has features that are designed for working with big data and serving it for further analysis and visualization. Some of these features are also available for Azure SQL Database. PolyBase T-SQL queries, this allow you to get data from outside sources like Hadoop or Azure blob storage using regular T-SQL queries. You don't need to know the details of the system you are querying since PolyBase takes care of it. Massively parallel processing, Azure SQL Data Warehouse is designed for data analytics performance when working with massive amounts of data. It can do this because of its MPP architecture. This means that the query is processed by a dedicated node that has its own CPU and memory. The ability to pause and resume the service, unlike running a data warehouse on-premises, you can actually pause Azure SQL Data Warehouse when you don't need it. When you do this, you don't have to pay the cost for the Data Warehouse, but you still do pay the costs for the data in it. When you want to use it again, you simply resume the service. In addition, Azure SQL Data Warehouse says similar features to Azure SQL database like data encryption at rest, and Azure AD integration. Azure SQL Data Warehouse is a limitless analytics service that brings together enterprise data warehousing and big data analytics. It gives you the freedom to query data on your terms using either serverless on-demand or provision resources at scale. Azure SQL Data Warehouse brings these two worlds together with a unified experience to ingest, prepare, manage, insert data for immediate BI and machine learning needs. In a Cloud Data Solution, data is ingested into big data stores from a variety of sources. Once in a big Datastore, Hadoop, Spark, and machine learning algorithms prepare and train the data. When the data is ready for complex analysis, SQL analytics uses PolyBase to query the Big Data Stores. PolyBase uses standard T-SQL queries to bring the data into SQL Analytics tables. SQL Analytic stores data in relational tables with columnar storage. This format significantly reduces the data storage costs and improves the query performance. Once data is stored, you can run analytics at massive scale. Compared to traditional database systems, analysis query finish in seconds instead of minutes or hours instead of days. The analysis results can go to worldwide reporting databases or applications. Business analysts can then gain insights to make well-informed business decisions. Azure Blob Storage is Microsoft's object storage solution for the Cloud. Blob Storage is optimized for storing massive amounts of unstructured data. Unstructured data does not adhere to a particular data model or definition, such as text or binary data. Azure Storage is Microsoft's Cloud storage solution for modern data storage scenarios. Azure Storage offers a massively scalable object store for data objects, a file system service for the Cloud, a messaging store for reliable messaging, and a NoSQL store. Azure storage is durable and highly available. Redundancy ensures that your data is safe in the event of transient hardware failures. You can also opt to replicate data across the data centers or geographical regions for additional protection from local catastrophe or natural disaster. Data replicated in this way remains highly available in the event of an unexpected outage. It's secure. All data written to Azure storage is encrypted by the service. Azure Storage provides you with fine grain control over who has access to your data. It's scalable, Azure storage is designed to be massively scalable to meet the data storage and performance needs of today's applications. It's managed, Microsoft Azure handles hardware maintenance updates and critical issues for you. It is accessible, data in Azure Storage is accessible from anywhere in the world over HTTP or HTTPS. Microsoft provides Client Libraries for Azure Storage, and a variety of languages including.Net, Java, Node.js, Python, PHP, Ruby, Go, and others as well as mature rest APIs. Azure Storage support scripting in Azure PowerShell or the Azure CLI. In the Azure portal, an Azure Storage Explorer offer easy visual solutions for working with your data. Azure Storage offers several types of storage accounts. Each type supports different features and has its own pricing model. Consider these differences before you create a storage account to determine the type of account that is best for your applications. The types of storage accounts are general-purpose V2 accounts, which are basic storage account types for blobs, files, cues, and tables. They are recommended for most scenarios using Azure Storage. General-purpose V1 accounts, which are legacy account types for blobs, files, cues, and tables. Use general-purpose V2 accounts instead when possible. There were also BlockBlobStorage accounts. These are accounts with premium performance characteristics for block blobs and append blobs. Recommended for scenarios with high transaction rates or scenarios that your smaller objects or require consistently low storage latency. File storage accounts. Files only storage accounts with premium performance characteristics. There recommended for enterprise or high-performance scale applications. Then finally, BlobStorage accounts, which are legacy blob only storage accounts. You should use general-purpose V2 accounts instead when possible. A container organizes a set of blobs similar to a directory in a file system. A storage account can include an unlimited number of containers and the container can store an unlimited number of blobs. ASI copy is an easy to use Command Line tool for Windows and Linux that copies data to and from blob storage across containers or across storage accounts. Let's take a moment to understand what a blob is. The word blob explains to binary large object. Blobs include images, text files, videos, and audio. There are three types of blobs in the service offered by Windows Azure, namely block, append and page blobs. Block blobs are a collection of individual blocks with a unique block ID. The block blobs allow users to upload large amounts of data. Append blobs are optimized blocks that help in making the operations efficient. Page blobs are compilation of pages. They allow random read and write operations. While creating a blob if the type is not specified, they are set to the block type by default. All the blobs must be inside a container in your storage. Thanks for listening to the first lesson in this module. Up next, we will talk about processing using Pipelines and Activities within Azure Data Factory.