Welcome to Module 1, an introduction to Data Processing with Azure. In this session, we will focus on batch processing. We will divide it into four modules; an Introduction, databricks, transforming data, and use cases. If you've followed any Microsoft news, there's a good chance you've heard of Microsoft Azure, formerly known as Windows Azure. This Cloud computing service is a big part of Microsoft's business, and it competes with similar services from Amazon and Google. At its core, Azure is a public Cloud computing platform with solutions including Infrastructure As a Service, Platform As a Service, and Software As a Service. They can be used for services such as analytics, virtual computing, storage, networking, and much more. It can be used to replace or supplement your on-premise servers. A series VMs have CPU performance and memory configurations best-suited for entry-level workloads like development than test. They are economical and provide a low cost solution to getting started with Azure. The Av2 standard is the latest generation of A Series VMs with similar CPU performance, but more RAM per CPU, and faster disks. The App Service Environment, or ASE, is a powerful feature offering of the Azure App Service that gives network isolation and improve scale capabilities. It is essentially a deployment of the Azure App Service into a subnet of a customer's Azure Virtual Network. While the feature gave customers what they were looking for in terms of network control in isolation, it was not as past life as the App Service was normally. Microsoft took the feedback to heart, and for ASE version two, they focused on making the user experience the same as it was in the multi-tenant App Service, while still providing the benefits that the ASE provided. Azure SQL databases, the intelligent, scalable Cloud database service that provides the broadest SQL Server engine compatibility, migrating existing apps or building new apps on Azure for your mission critical SQL server workloads. We will be discussing Blob storage and Data Lake Storage later. Azure Cosmos DB is Microsoft's globally distributed, multi-model database service. With the click of a button, Cognos DB enables you to elastically and independently scale throughput and storage across any number of Azure regions worldwide. Azure has a directory of hundreds of services, and it was originally named the Windows Azure. You can run virtual machines on Azure in services that aren't exclusive to Azure. Microsoft is also using Azure to extend Windows, and organizations can now use Azure Active Directory. This service allows organizations to lift and shift apps that use on-premise AD for authentication to the Cloud, extending the capabilities of active directory to provide many of the features of on-premise Windows Server Active Directory but without the effort of installing domain controllers, setting up express route, or a VPN to connect to on-premise DCs to Azure. Now that we know what Azure can do, let's have a look at its architecture. All the solutions are dependent on services for architecture. The code runs within the Windows Azure compute, and you don't need to learn any new language, API stack, or any other development environments. Three role types run as VM. There are various broadway's a Cloud-based service is consumed and utilized. In the world of Cloud computing, there are three different approaches to Cloud-based services. First is the Infrastructure as a service, otherwise known as IaaS, second, Platform as a service, PaaS, and third, Software as a service, SaaS. Azure is both Infrastructure as a service and Platform as a service, which makes the Windows server operating system and other features available as services. Let's discuss how you can communicate with your deployments. You will get a DNS name when you deploy your solution to Azure. This name is used to access your hosted applications. You can also configure the CNAME entries for a domain you own to access your Azure hosted application through it. Communication endpoints are part of the role configuration known as the service definition. Azure roles provide endpoints that respond to requests from outside the data center, as well as defining internal endpoints..NET development on Azure is done using Visual Studio 2010. The Windows Azure SDK for.NET is installed there and includes tools that simulate Azure Compute and Azure Storage on your local machine. You can develop Azure hosted applications on non-.NET platforms as long as the platform runs on Windows Server 2008. Azure offers four data storage options. All of them aside from caching are supported for use from Azure-hosted roles or non-Azure applications. You can mount an Azure drive as a local drive stored by Azure Blob storage within your instance to keep transient bata. Blob storage is capable of storing binary data as page blobs and block blobs. Page blobs have better performance for random writes and are ideal for storing VHD data. Page blob store up to one terabyte of data and avoid charges for empty space by not storing pages of zeros. Azure Queue storage delivers asynchronous messaging for communication between application components, whether they are running in the Cloud, on the desktop, on an on-premise server, or on a mobile device. Queue storage also supports managing asynchronous tasks and building process workflows. Azure Queue storage is a service for storing large numbers of messages that can be accessed from anywhere in the world via authenticated calls using HTTP or HTTPS. A single queue message can be up to 64 kilobytes in size, and a queue can contain millions of messages up to the total capacity limit for a storage account. Queue storage is often used to create a backlog of work to process asynchronously. Let's now discuss the traditional ETL processes. ETL stands for extract, transform, and load. It also includes the transportation of data, overlaps between stages, and changes in flow due to new technologies. Data is extracted from OLTP, otherwise known as transactional databases, and then transformed in a staging area for cleaning and optimization, which is then sent to OLAP databases, otherwise known as analytics databases for analysis. In the modern ETL process, Cloud-based analytics databases perform transformations in place, instead of in staging areas. Software as a service applications now house business-critical data that are more easily accessible and raw data is more easily analyzed. Another term which is widely heard about is Apache Spark. Apache Spark has as its architectural foundation, the Resilient Distributed Dataset, otherwise known as RDD. It's a read-only multiset of data items distributed over a cluster of machines which is maintained in a fault-tolerant way. The Dataframe API was released as an abstraction on top of the RDD, followed by the dataset API. In Spark 1.x, the RDD was primarily a programming interface, also known as an API, but as of Spark 2.x, use of the dataset APIs encouraged even though the RDD API is not deprecated. The RDD technology still underlies the dataset API. Apache Spark requires a cluster and a distributed storage system. For cluster management, Spark supports standalone, otherwise known as a native spark cluster or you can launch a cluster either manually or use the launch scripts provided by the install package. It is possible to run these daemons on a single machine for testing. Let's move on to the next module related to Databricks and Data Factory.