IBM
ETL and Data Pipelines with Shell, Airflow and Kafka
IBM

ETL and Data Pipelines with Shell, Airflow and Kafka

This course is part of multiple programs.

Taught in English

Some content may not be translated

Jeff Grossman
Yan Luo
Lavanya Thiruvali Sunderarajan

Instructors: Jeff Grossman

44,506 already enrolled

Included with Coursera Plus

Course

Gain insight into a topic and learn the fundamentals

4.5

(324 reviews)

|

87%

Intermediate level

Recommended experience

17 hours (approximately)
Flexible schedule
Learn at your own pace

What you'll learn

  • Describe and contrast Extract, Transform, Load (ETL) processes and Extract, Load, Transform (ELT) processes.

  • Explain batch vs concurrent modes of execution.

  • Implement ETL workflow through bash and Python functions.

  • Describe data pipeline components, processes, tools, and technologies.

Details to know

Shareable certificate

Add to your LinkedIn profile

Assessments

11 assignments

Course

Gain insight into a topic and learn the fundamentals

4.5

(324 reviews)

|

87%

Intermediate level

Recommended experience

17 hours (approximately)
Flexible schedule
Learn at your own pace

See how employees at top companies are mastering in-demand skills

Placeholder

Build your subject-matter expertise

This course is available as part of
When you enroll in this course, you'll also be asked to select a specific program.
  • Learn new concepts from industry experts
  • Gain a foundational understanding of a subject or tool
  • Develop job-relevant skills with hands-on projects
  • Earn a shareable career certificate
Placeholder
Placeholder

Earn a career certificate

Add this credential to your LinkedIn profile, resume, or CV

Share it on social media and in your performance review

Placeholder

There are 5 modules in this course

ETL or Extract, Transform, and Load processes are used for cases where flexibility, speed, and scalability of data are important. You will explore some key differences between similar processes, ETL and ELT, which include the place of transformation, flexibility, Big Data support, and time-to-insight. You will learn that there is an increasing demand for access to raw data that drives the evolution from ETL to ELT. Data extraction involves advanced technologies including database querying, web scraping, and APIs. You will also learn that data transformation is about formatting data to suit the application and that data is loaded in batches or streamed continuously.

What's included

7 videos2 readings2 assignments1 plugin

Extract, transform and load (ETL) pipelines are created with Bash scripts that can be run on a schedule using cron. Data pipelines move data from one place, or form, to another. Data pipeline processes include scheduling or triggering, monitoring, maintenance, and optimization. Furthermore, Batch pipelines extract and operate on batches of data. Whereas streaming data pipelines ingest data packets one-by-one in rapid succession. In this module, you will learn that streaming pipelines apply when the most current data is needed. You will explore that parallelization and I/O buffers help mitigate bottlenecks. You will also learn how to describe data pipeline performance in terms of latency and throughput.

What's included

5 videos4 readings4 assignments1 app item1 plugin

The key advantage of Apache Airflow's approach to representing data pipelines as DAGs is that they are expressed as code, which makes your data pipelines more maintainable, testable, and collaborative. Tasks, the nodes in a DAG, are created by implementing Airflow's built-in operators.​ In this module, you will learn about Apache Airflow having a rich UI that simplifies working with data pipelines. You will explore how to visualize your DAG in graph or tree mode. You will also learn about the key components of a DAG definition file, and you will learn that Airflow logs are saved into local file systems and then sent to cloud storage, search engines, and log analyzers.

What's included

5 videos1 reading2 assignments4 app items1 plugin

Apache Kafka is a very popular open source event streaming pipeline. An event is a type of data that describes the entity’s observable state updates over time. Popular Kafka service providers include Confluent Cloud, IBM Event Stream, and Amazon MSK. Additionally, Kafka Streams API is a client library supporting you with data processing in event streaming pipelines. In this module, you will learn that the core components of Kafka are brokers, topics, partitions, replications, producers, and consumers. You will explore two special types of processors in the Kafka Stream API stream-processing topology: The source processor and the sink processor. You will also learn about building event streaming pipelines using Kafka.

What's included

4 videos1 reading2 assignments3 app items1 plugin

In this final assignment module, you will apply your newly gained knowledge to explore two very exciting hands-on labs. “Creating ETL Data Pipelines using Apache Airflow” and “Creating Streaming Data Pipelines using Kafka”. You will explore building these ETL pipelines using real-world scenarios. You will extract, transform, and load data into a CSV file. You will also create a topic named “toll” in Apache Kafka, download and customize a streaming data consumer, as well as verifying that streaming data has been collected in the database table.

What's included

4 readings1 assignment1 peer review3 app items

Instructors

Instructor ratings
4.7 (87 ratings)
Jeff Grossman
IBM
2 Courses55,866 learners
Yan Luo
IBM
7 Courses291,637 learners

Offered by

IBM

Recommended if you're interested in Data Management

Why people choose Coursera for their career

Felipe M.
Learner since 2018
"To be able to take courses at my own pace and rhythm has been an amazing experience. I can learn whenever it fits my schedule and mood."
Jennifer J.
Learner since 2020
"I directly applied the concepts and skills I learned from my courses to an exciting new project at work."
Larry W.
Learner since 2021
"When I need courses on topics that my university doesn't offer, Coursera is one of the best places to go."
Chaitanya A.
"Learning isn't just about being better at your job: it's so much more than that. Coursera allows me to learn without limits."

Learner reviews

Showing 3 of 324

4.5

324 reviews

  • 5 stars

    70.12%

  • 4 stars

    17.68%

  • 3 stars

    6.09%

  • 2 stars

    3.35%

  • 1 star

    2.74%

DL
5

Reviewed on Sep 6, 2022

MB
5

Reviewed on Oct 11, 2022

BN
5

Reviewed on Mar 30, 2023

New to Data Management? Start here.

Placeholder

Open new doors with Coursera Plus

Unlimited access to 7,000+ world-class courses, hands-on projects, and job-ready certificate programs - all included in your subscription

Advance your career with an online degree

Earn a degree from world-class universities - 100% online

Join over 3,400 global companies that choose Coursera for Business

Upskill your employees to excel in the digital economy

Frequently asked questions