Delve into the two different approaches to converting raw data into analytics-ready data. One approach is the Extract, Transform, Load (ETL) process. The other contrasting approach is the Extract, Load, and Transform (ELT) process. ETL processes apply to data warehouses and data marts. ELT processes apply to data lakes, where the data is transformed on demand by the requesting/calling application.



ETL and Data Pipelines with Shell, Airflow and Kafka
This course is part of multiple programs.



Instructors: Jeff Grossman
Access provided by University of Texas at Austin
63,405 already enrolled
(431 reviews)
Recommended experience
What you'll learn
Describe and contrast Extract, Transform, Load (ETL) processes and Extract, Load, Transform (ELT) processes.
Explain batch vs concurrent modes of execution.
Implement ETL workflow through bash and Python functions.
Describe data pipeline components, processes, tools, and technologies.
Skills you'll gain
Details to know

Add to your LinkedIn profile
See how employees at top companies are mastering in-demand skills

Build your subject-matter expertise
- Learn new concepts from industry experts
- Gain a foundational understanding of a subject or tool
- Develop job-relevant skills with hands-on projects
- Earn a shareable career certificate

There are 5 modules in this course
ETL or Extract, Transform, and Load processes are used for cases where flexibility, speed, and scalability of data are important. You will explore some key differences between similar processes, ETL and ELT, which include the place of transformation, flexibility, Big Data support, and time-to-insight. You will learn that there is an increasing demand for access to raw data that drives the evolution from ETL to ELT. Data extraction involves advanced technologies including database querying, web scraping, and APIs. You will also learn that data transformation is about formatting data to suit the application and that data is loaded in batches or streamed continuously.
What's included
7 videos2 readings2 assignments1 plugin
Extract, transform and load (ETL) pipelines are created with Bash scripts that can be run on a schedule using cron. Data pipelines move data from one place, or form, to another. Data pipeline processes include scheduling or triggering, monitoring, maintenance, and optimization. Furthermore, Batch pipelines extract and operate on batches of data. Whereas streaming data pipelines ingest data packets one-by-one in rapid succession. In this module, you will learn that streaming pipelines apply when the most current data is needed. You will explore that parallelization and I/O buffers help mitigate bottlenecks. You will also learn how to describe data pipeline performance in terms of latency and throughput.
What's included
5 videos4 readings4 assignments1 app item1 plugin
The key advantage of Apache Airflow's approach to representing data pipelines as DAGs is that they are expressed as code, which makes your data pipelines more maintainable, testable, and collaborative. Tasks, the nodes in a DAG, are created by implementing Airflow's built-in operators. In this module, you will learn about Apache Airflow having a rich UI that simplifies working with data pipelines. You will explore how to visualize your DAG in graph or tree mode. You will also learn about the key components of a DAG definition file, and you will learn that Airflow logs are saved into local file systems and then sent to cloud storage, search engines, and log analyzers.
What's included
5 videos1 reading2 assignments4 app items1 plugin
Apache Kafka is a very popular open source event streaming pipeline. An event is a type of data that describes the entity’s observable state updates over time. Popular Kafka service providers include Confluent Cloud, IBM Event Stream, and Amazon MSK. Additionally, Kafka Streams API is a client library supporting you with data processing in event streaming pipelines. In this module, you will learn that the core components of Kafka are brokers, topics, partitions, replications, producers, and consumers. You will explore two special types of processors in the Kafka Stream API stream-processing topology: The source processor and the sink processor. You will also learn about building event streaming pipelines using Kafka.
What's included
4 videos1 reading2 assignments3 app items1 plugin
In this final assignment module, you will apply your newly gained knowledge to explore two very exciting hands-on labs. “Creating ETL Data Pipelines using Apache Airflow” and “Creating Streaming Data Pipelines using Kafka”. You will explore building these ETL pipelines using real-world scenarios. You will extract, transform, and load data into a CSV file. You will also create a topic named “toll” in Apache Kafka, download and customize a streaming data consumer, as well as verifying that streaming data has been collected in the database table.
What's included
4 readings1 assignment1 peer review3 app items
Earn a career certificate
Add this credential to your LinkedIn profile, resume, or CV. Share it on social media and in your performance review.
Instructors


Offered by
Why people choose Coursera for their career




Learner reviews
431 reviews
- 5 stars
71.06%
- 4 stars
17.36%
- 3 stars
6.48%
- 2 stars
2.54%
- 1 star
2.54%
Showing 3 of 431
Reviewed on Jan 20, 2025
Relevant information in recordings, good recap of every video and hand-on lesson in the end to concrete the knowledge.
Reviewed on Jun 21, 2022
It takes 1 hour to connect the lab and start the service.
Reviewed on Jan 25, 2022
It's great introduction for airflow and kafka but still an introduction it is shallow doesn't offer much but at the end you will understand what you need to continue further in both technologies.
¹ Some assignments in this course are AI-graded. For these assignments, your data will be used in accordance with Coursera's Privacy Notice.