Welcome to Module 7, ELT versus ETL in PolyBase. First we'll start with an introduction. In this module, we will review ETL processing, ETL processing and PolyBase, differentiating ETL processing from ELT processing, how SQL Data Warehouse and Microsoft offers ELT solutions, SQL Data Warehouse loading methods using non-PolyBase options, and the use case on a deeper dive into ETL processing using PolyBase and ELT solutions using Microsoft Data Warehouse ELT approach. Earlier in this course we discussed ETL. ETL is a critical layer in data warehousing responsible for data extraction, cleaning, conforming, and loading into the target. Enterprises invest in data warehousing projects to enhance their activity and to measure performance. It aims to improve decision-making by providing unique access to several sources which are autonomous or semi-autonomous. The two most popular types are databases and flat files. ETL is the integration layer in the data warehouse environment. ETL tools pull data from several sources and applies complex transformations to them. Data is loaded into the target data warehouse store in the environment. Data from a set of sources may be local or distant. Data stores logically come from operational applications, but there's an option to use external data sources for enrichment. During extraction, ETL tries to access available sources, pull out the relevant data, and reformat it in a specified format. The transformation step in ETL is the most laborious one, where ETL adds value by cleaning and conforming the data. Cleaning the data aims to fix errors such as missing or bad data, and then delivers clean data for the end users. The loading step sends data to a set of targets for storing. During this step, ETL loads data into targets like fact tables and dimensions in the data warehousing context. It is essential to overcome the ETL modeling phase efficiently in order to produce simple and understandable models. This method is spread over four steps: Identification of the sources, a distinction between candidate sources and active sources, attribute mapping, and annotation of diagrams with execution constraints. When changes happen, analyzing the impact of change is required to avoid errors and mitigate the risk of breaking existing treatments. Because of this, the cost of the maintenance task will be high without a helpful tool and an effective approach for change management. Getting data out of your source system depends on the storage location. The goal is to move data into PolyBase supported delimited text files. PolyBase loads data from UTF-8 and UTF-16 encoded delimited text files. It also loads from the Hadoop file formats, RC file, ORC and Parquet. PolyBase can also load data from Gzip and Snappy compressed files. PolyBase currently does not support extended ASCII, fixed-width format, and nested formats such as WinZip, JSON, and XML. If you're exporting from SQL Server, you can use the bcp command-line tool to export the data into delimited text files. To land the data in Azure Storage, you can move it to Azure Blob Storage or Azure Data Lake Store. PolyBase can load from either location. You can use the following tools and services to move data to Azure Storage: Azure ExpressRoute, the AzCopy utility, or Azure Data Factory. You might need to prepare and clean the data in the storage account before loading it into SQL Data Warehouse. Data preparation can be performed while the data is in the source as you export the data to text files, or after the data is in Azure Storage. It is easiest to work with the data as early in the process as possible. You need to define external tables in your data warehouse before you can load data. PolyBase uses external tables to define and access the data in Azure Storage. An external table is similar to a database view. The external table contains the table schema and points to data that is stored outside the data warehouse. Defining external tables involves specifying the data source, the file format of the text files, and the table definitions. You'll need the following T-SQL syntax topics: Create external data store, create external file format, and create external table. Once the external objects are defined, you need to align the rows of the text files with the external table and file format definition. The data in each row of the text file must align with the table definition. The best practice is to load data into a staging table. Staging tables allow you to handle errors without interfering with the production tables. A staging table also gives you the opportunity to use SQL Data Warehouse MPP for data transformations, before inserting the data into production tables. You can use any of the following options to load data with PolyBase: PolyBase with T-SQL, PolyBase with SSIS, PolyBase with ADF, and PolyBase with Azure Databricks. PolyBase with T-SQL works well when your data is in Azure Blob Storage or Azure Data Lake Store. It gives you the most control over the loading process, but requires you to define external data objects. The other methods define these objects behind the scenes as you map source tables to destination tables. To orchestrate T-SQL load, you can use ADF, SSIS, or Azure functions. PolyBase with SSIS works well when your source data is in SQL Server, On-premise or in the Cloud. SSIS defines the source to destination table mappings and also orchestrates the load. If you already have SSIS packages, you can modify the packages to work with the new data warehouse destination. PolyBase with ADF is another orchestration tool. It defines a pipeline and schedules jobs. PolyBase with Azure Databricks transfers data from a SQL Data Warehouse table to a Databricks DataFrame and writes data from a Databricks DataFrame to a SQL Data Warehouse table using PolyBase. If your data is not compatible with PolyBase, you can use BCP or the SQLBulkCopy API. BCP is intended only for small loads and loads directly to your SQL Data Warehouse without going through Azure Blob Storage. The load performance of these options is significantly slower than PolyBase. Perform transformations that your workload requires while data is in a staging table, then move the data into a production table. The INSERT INTO SELECT statement moves the data from the staging table to the permanent one. ETL and ELT are acronyms for Extract, Transform, and Load. It is a process required to mitigate and transform information from one data system to another. The functional difference between ETL and ELT relates to the order of the process. With ETL, you can extract data, transform it into a compatible structure, and load it into a target data warehouse system so business intelligence tools can query and analyze it. ETL generally refers to the data transformation process required to use a data warehouse system such as Amazon Redshift. With ELT, you extract data, immediately load it into the target data lake system, then transform it so that business intelligence tools can query and analyze it. ELT generally refers to the data transformation process required to transform data already located in a data lake or warehouse. Data lakes are special kinds of data stores that accept any kind of structured or unstructured data. Data lakes don't require you to transform your data before loading it. You can immediately load any kind of raw information into a data lake, no matter the format or lack thereof. Data transformation is still necessary before analyzing the data with the business intelligence platform. However, data cleansing, enrichment, and transformation occur after loading the data into the data lake. The following are details you should understand about ELT and data lakes. ELT is a relatively new technology made possible due to modern cloud-based server technologies. Cloud-based data warehouses offer near-endless storage capabilities and scalable processing power. Platforms like Amazon Redshift, and Google BigQuery make ELT pipelines possible because of their incredible processing capabilities. ELT paired with the data lake lets you ingest an ever-expanding pool of raw data immediately as it becomes available. There's no requirement to transform the data into a special format before saving it into the data lake. ELT transforms only the data required for a particular analysis. Though it can slow down the process of analyzing the data, it offers more flexibility because you can transform the data in different ways on the fly to produce different types of metrics, forecasts, and reports. If the previously decided structure doesn't allow for a new type of analysis, the entire ETL pipeline and the structure of the data in the OLAP Warehouse may require modification. Tools and systems of ELT are still evolving, so they aren't as reliable as ETL paired with an OLAP database. ELT provides more accurate insights when dealing with massive pools of data, even though it takes more effort to set up. Also ELT developers who know how to use ELT technology are more difficult to find than ETL developers. The table on the following few slides lists some details about the differences between ETL and ELT. Please review these and understand the differences between the two methods. This is the end of the section on ELT versus ETL in PolyBase. Next, we'll talk about how SQL Data Warehouse in Microsoft offers ETL solutions.