Welcome to Module 3, Azure Link Services. In this module, we will discuss the following. What is a task for a dataset? What a blob containers and linked services. How are linked services created for data stored within data Factory. How do you identify pipelines for data factory. What are data stores, Azure blob datasets, containers, and folders. How do you create a linked service and connect data factory to external resources. And finally, how do you process input blobs with Azure data factory. A VPN Gateway is a specific type of virtual network gateway that is used to send encrypted traffic between an Azure virtual Network and an on-premise location over the public Internet. You can also use a VPN gateway to send encrypted traffic between Azure Virtual Networks over the Microsoft network. Application Services are a pool of services such as load balancing, application performance monitoring, application acceleration, auto-scaling, micro segmentation, service proxy, and service discovery needed to optimally deploy, run, and improve applications. A VPN Gateway is a specific type of VNet gateway that is used to send traffic between an Azure Virtual Network and an on-premise location over the public Internet. You can also use a VPN gateway to send traffic between VNets. Each VNet can only have one VPN gateway. Configuring a VNet2VNet connection is a simple way to connect VNets, when you connect a virtual network to another virtual network with a VNets2VNet connection type, it's similar to creating a site-to-site IPsec connection to an on-premise location. The Azure App Service has two variations, the multi-tenant systems that support the full range of pricing plans except isolated, or the app service environment which deploys into your VNet and supports isolated pricing plan apps. If your app is in the app service environment, then it's already in a VNet and doesn't require the use of VNet integration features to reach resources in the same VNet. There are two forms to the VNet integration feature that enables integration with VNets in the same region. The first form of the feature requires a subnet in a VNet in the same region. The other form enables integration with VNets and other regions or with classic VNets, this version of the feature requires a deployment of a Virtual Network Gateway in your VNet. This is the point to site VPN base feature and is only supported with Windows apps. Data Movement Activities move data between supported source and sync data store. Data Transformation Activities transform data using compute services such as Azure HDInsight as your badge and Azure machine learning. Linked Services are connections to data sources and destinations, data sources or destinations maybe on Azure or on-premises. Before you create a dataset, you must create a linked service to link your data to the data factory. Linked services are much like connection strings which defined the connection information needed for a data factory to connect to external resource. The dataset represents the structure of the data within the link data stores and the linked service defines the connection to the data source. For example, an Azure storage linked service, links a storage account to the data factory. And Azure blob dataset represents the blob container in the folder within that Azure storage account that contains the input blobs to be processed. JSON is a popular textual data format that you use for exchanging data in modern web and mobile applications. JSON is also used for storing unstructured data in log files or NoSQL databases such as Microsoft Azure Cosmos DB. Here, we will examine a dataset in Azure SQL using Azure linked service providing frequency and interval as parameters. This is a JSON copy script that has three columns, sliced timestamp, project a name, and a page views. This JSON script looks at a copy command with frequency interval and offset variables. This JSON dataset script looks at a copy command with frequency interval offset and style variables. External datasets are the ones not produced by running a pipeline in the data factory. If the data set is marked as external, the external data policy may be defined to influence, the behavior of the dataset slice availability. Unless a dataset is being produced by data factory, it should be marked as external, this setting generally applies to inputs of a first activity in a pipeline unless activity or pipeline chaining is being used. You can create datasets that are scoped to a pipeline by using the datasets property. These datasets can only be used by activities in this pipeline. Scope datasets are supported only within one time pipelines where pipeline mode is set to one time. The following example defines a pipeline with two datasets to be used within the pipeline. This is a continuation of the previous script with additional information about input and output datasets with Azure blob storage being provided using input output linked services. Here's a quick overview of what we have discussed. A native factory can have one or more pipelines, the activities in a pipeline define actions to perform on the data. A dataset is a named view of data that points or references the data you want to use in your activities as inputs and outputs. Before creating a dataset, you must create a linked service to link your data store to the data factory. Linked services are like connection strings which define the connection information needed for data factory to connect to external resources. The dataset represents the structure of the data within the link data stores. And the linked service defines the connection to the data source. So an Azure storage linked service links a storage account to the data factory. An Azure blob data set represents the blob container and the folder within that Azure storage account that contains the input blobs to be processed. Here, we shed light on data store type properties used within integration runtime references for a dataset. Here, an Azure storage linked service breaks down a connection string into account name and account key to reference the data transfer utilities at runtime. Data factory contains a series of interconnected systems that provide a complete end-to-end platform for data engineers, we'll explore each of the following. Connect and collect, transform and enrich, CI/CD and publish, and monitoring. Enterprises have data of various types located in different sources, such as on-premises in the cloud, structured, unstructured, and semi-structured all arriving at different intervals and speeds. The first step in building an information production system is to connect all the required sources of data and processing. The next step is to move the data is needed to a centralized location for subsequent processing. You can also use the copy activity in a data pipeline to move data from on-premise and cloud data stores to a centralized data stored in the cloud for further analysis. So you could collect data in Azure Data Lake Storage and transform it later by using an Azure Data Lake Analytics compute service. Data flows allow data engineers to develop graphical data transformation logic without writing code. The resulting data flows are executed as activities within Azure data factory pipelines that you scaled out spark clusters. Data flow activities can be operationalized via existing data factory scheduling control flow and monitoring capabilities. You create a linked service for the compute environment and then use the linked service when defining a transformation activity. There are two types of compute environment supported by Data Factory. On-Demand, in this case, the computing environment is fully managed by Data Factory. It is automatically created by the Data Factory service before a job is submitted to process data and removed when the job is completed. You can configure and control granular settings of the on-demand compute environment for job execution, cluster management, and bootstrapping actions. Or Bring Your Own, in this case, you can register your own computing environment as a linked service in Data Factory. The computing environment is managed by you and the data factory service, which uses it to execute the activities. With visual tools you can iteratively build, debug, deploy, and operationalize, and monitor your big data pipelines. Now, you can follow industry-leading best practices to do continuous integration and deployment for your extract, transform, and load, and extract, load, and transform workloads. You can do this to multiple environments such as dev, test, prod, and more. Essentially, you can incorporate the practice of testing for your code base changes and push the tested changes to a test or prod environment automatically. Azure Monitor Logs is a service in Azure Monitor that monitors your cloud and on-premise environments to maintain their availability and performance. It collects data generated by resources in your cloud and on-premise environments and from other monitoring tools to provide analysis across multiple sources. The Log Analytics Workspace, you can think of this workspace as a unique Azure Monitor log environment with its own data repository, data sources, and solutions. Currently, you can use Azure Monitor logs with the following HDInsight cluster types, Hadoop, HBase, Interactive Query, Kafka, Spark, and Storm. An Azure subscription might have one or more Azure Data Factory instances. Azure Data Factory is composed of four key components. These components work together to provide the platform on which you can compost data-driven workflows and steps to move in transform data. Data pipelines underlie many data analytics solutions, as the name suggests a data pipeline takes in raw data, cleans and reshapes it as needed, and then typically performs calculations or aggregations before storing the process data. The process data is consumed by clients reports or APIs. Mapping data flows are visually designed data transformations in Azure Data Factory. Data flow activities can be operationalized via existing Data Factory scheduling, control, flow, and monitoring capabilities. Mapping data flows provide a fully visual experience with no coding required. You may use a copy activity to copy data from one data store to another, or use a Hive activity which runs a Hive query on an Azure HDOnsight cluster to transform our analyze data. Data Factory supports three types of activities, data movement activities, data transformation activities, and control activities. Some of the key tasks in data science involves basic exploration of new or existing data. Raw data is given structure, data can be joined to other datasets, features are selected for later analysis, and much more. Depending on the question to which you seek answers as well as other requirements, the process repeats until you have data that is ideal for further more advanced analytics. Linked services are used for two purposes in data factory, to represent a data store that includes but is not limited to, an on-premise SQL Server database, Oracle database, file share, or Azure blob storage account. And to represent a compute resource that can host the execution of an activity, for example, HDInsight Hive activities run on an HDInsight Hadoop cluster. Triggers represent the unit of processing that determines when a pipeline execution needs to be kicked off. Pipeline runs are typically instantiated by passing the arguments to the parameters defined in pipelines. The arguments can be passed manually or within the trigger definition. A dataset is a strongly type parameter in a reusable referenceable entity. An activity can reference datasets and can consume the properties defined in the dataset definition. A linked service is a strongly type parameter that contains connection information to either a data store or a computer environment. It is also a reusable referenceable entity. Control flow is an orchestration of pipeline activities that includes chaining activities in a sequence, branching, defining parameters at the pipeline level, and passing arguments while invoking the pipeline on demand or from a trigger. Variables can be used inside and outside of pipelines to store temporary values, and also can be used in conjunction with parameters to enable passing values between pipelines, data flows, and other activities. Up next, we will discuss identifying pipelines for a Data Factory.