Open source Data Engineering with Spark, dbt & Airflow Professional Certificate

Build Production Data Pipelines at Scale.

Explore Spark, dbt, and Airflow to design, automate, and deploy enterprise-grade data pipelines.

Instructor: Professionals from the Industry

Access provided by PALC Dev

6 course series

Earn a career credential that demonstrates your expertise

Intermediate level

Recommended experience

4 weeks to complete

at 10 hours a week

Flexible schedule

Learn at your own pace

6 course series

Earn a career credential that demonstrates your expertise

Intermediate level

Recommended experience

4 weeks to complete

at 10 hours a week

Flexible schedule

Learn at your own pace

What you'll learn

Build modular, production-grade data pipelines using Apache Spark, dbt, and Airflow to ingest, transform, and load data at scale.
Design and implement dimensional data models including star schemas, SCD Type 2, and incremental load strategies for data warehouses.
Optimize distributed data processing by resolving Spark shuffle, skew, and partitioning issues to improve pipeline performance.
Automate deployments and enforce data quality using CI/CD pipelines, Docker containers, and automated testing frameworks like Great Expectations.

Skills you'll gain

Tools you'll learn

Details to know

Shareable certificate

Add to your LinkedIn profile

Taught in English

See how employees at top companies are mastering in-demand skills

Learn more about Coursera for Business

logos of Petrobras, TATA, Danone, Capgemini, P&G and L'Oreal

Advance your career with in-demand skills

Receive professional-level training from Coursera
Demonstrate your technical proficiency
Earn an employer-recognized certificate from Coursera

Professional Certificate - 6 course series

This program equips you with the open-source tools and architectural thinking used by professional data engineers to build scalable, reliable data systems from the ground up. You will work hands-on with Apache Spark for distributed data processing, dbt for modular SQL-based transformation, and Apache Airflow for workflow orchestration — the same stack powering data infrastructure at leading technology and data-driven organizations worldwide.

Across the courses, you will gain practical expertise in designing dimensional data models, implementing incremental load strategies, optimizing Spark job performance, enforcing data quality with automated testing frameworks, and deploying pipelines through CI/CD workflows. You will also develop foundational skills in cloud storage provisioning, containerization with Docker, and version control best practices that mirror real production environments.

By the end of this Program, you will be able to design and deploy end-to-end data pipelines that ingest from diverse sources, transform data through well-tested models, and deliver analytics-ready datasets to downstream consumers — demonstrating job-ready engineering skills valued across analytics engineering, data platform, and data infrastructure roles.

Applied Learning Project

Throughout this Program, you will complete hands-on projects that mirror real production data engineering challenges — from building modular ETL pipelines that ingest CRM and streaming data into a cloud data warehouse, to authoring Airflow DAGs with retry logic and SLA monitoring, to diagnosing Spark performance bottlenecks and implementing Delta Lake versioning. Each project asks you to work in your own development environment, producing portfolio-ready artifacts that demonstrate your ability to design, optimize, and deploy reliable data infrastructure using open-source tools.

Building Automated Data Pipelines with Spark,dbt,and Airflow

Course 1, 9 hours

What you'll learn

Build end-to-end data pipelines that automatically ingest from databases, APIs, and streams using Spark, dbt, and Airflow tools.
Design data models with historical tracking using SCD Type 2 patterns to preserve complete change history for analytics.
Create automated workflows with intelligent retry logic, SLA monitoring, and parameterization for production reliability.
Optimize Spark job performance using partitioning and caching strategies to achieve 30%+ runtime improvements.

Skills you'll gain

Category: Apache Airflow

Category: Data Pipelines

Category: Apache Spark

Category: Data Transformation

Category: Data Flow Diagrams (DFDs)

Category: Database Development

Category: Data Modeling

Category: Configuration Management

Category: Diagram Design

Category: Data Mapping

Category: Data Processing

Category: Extract, Transform, Load

Category: Enterprise Security

Category: Data Integration

Category: Data Warehousing

Category: Data Architecture

Optimizing Spark and Cloud Data Storage for Analytics

Course 2, 10 hours

What you'll learn

Optimize Spark job performance through strategic partitioning and caching, achieving 30%+ runtime improvements using data access analysis.
Implement transactional data lakes with Delta format, enabling versioning, ACID operations, and schema evolution for reliable datasets.
Provision secure cloud data infrastructure using IAM policies, private networks, and encrypted storage following security best practices.
Evaluate and benchmark storage formats (Parquet, ORC, Avro) to select optimal solutions for analytical workloads and cost efficiency.

Skills you'll gain

Category: Apache Spark

Category: Performance Tuning

Category: Cloud Security

Category: Transaction Processing

Category: Data Warehousing

Category: Data Storage

Category: Data Security

Category: Security Controls

Category: Infrastructure Architecture

Category: PySpark

Category: Data Management

Category: Cloud Deployment

Category: Infrastructure as Code (IaC)

Category: Data Integrity

Category: Cloud Computing

Category: Cloud Storage

Category: Cloud Infrastructure

Category: Data Lakes

Category: Cloud Computing Architecture

Category: Data Storage Technologies

Data Modeling & Warehousing Fundamentals in Data Engineering

Course 3, 9 hours

What you'll learn

Design star schema data models with fact and dimension tables that enable intuitive self-service business intelligence reporting.
Apply third normal form normalization to optimize database structure while maintaining query performance through indexing strategies.
Use advanced SQL window functions to calculate rolling metrics, rankings, and time-series analytics for complex data analysis.
Implement database replication and incremental loading techniques to ensure high availability and efficient data warehouse updates.

Skills you'll gain

Category: SQL

Category: Performance Tuning

Category: Database Management

Category: Extract, Transform, Load

Category: Star Schema

Category: Data Warehousing

Category: Database Design

Category: Data Integration

Category: Relational Databases

Category: Data Modeling

Category: Database Development

Category: PostgreSQL

Category: Data Infrastructure

Category: Database Architecture and Administration

Category: Database Software

Category: Database Theory

Category: Business Intelligence

DevOps and CI/CD for Data Engineering Performance

Course 4, 12 hours

What you'll learn

Resolve merge conflicts and trace bugs using Git history tools, keeping collaborative codebases stable and production-ready.
Design branching strategies and automate deployments with CI/CD pipelines to safely promote data pipeline artifacts across environments.
Build and publish versioned Docker images and automate server configuration with Ansible for consistent, reproducible environments.
Analyze query execution metrics and optimize resource allocation to maintain performance targets in production data systems.

Skills you'll gain

Category: DevOps

Category: CI/CD

Category: Git (Version Control System)

Category: Containerization

Category: Performance Tuning

Category: Ansible

Category: Data Pipelines

Category: Software Versioning

Category: Version Control

Category: Development Environment

Category: Application Deployment

Category: Continuous Deployment

Category: Data Infrastructure

Category: Infrastructure as Code (IaC)

Category: Devops Tools

Category: Root Cause Analysis

Category: Configuration Management

Category: Docker (Software)

Category: Continuous Integration

Data Quality and Debugging for Reliable Pipelines

Course 5, 7 hours

What you'll learn

Define and automate data quality tests using YAML to validate row counts, null thresholds, and uniqueness across pipeline datasets.
Trace data anomalies through pipeline stages by analyzing logs and dashboards to identify and fix the exact source of failure.
Apply advanced Python debugging tools — including conditional breakpoints, watchpoints, and pdb — to diagnose and resolve pipeline issues.
Resolve complex concurrency bugs by reading stack traces and correlating thread logs to identify deadlocks and race conditions in code.

Skills you'll gain

Category: Data Quality

Category: Debugging

Category: Data Validation

Category: Anomaly Detection

Category: YAML

Category: Test Automation

Category: Data Integrity

Category: Memory Management

Category: Generative AI

Category: Performance Tuning

Category: Python Programming

Category: CI/CD

Category: Reliability

Category: Test Tools

Category: AI Integrations

Category: Data Pipelines

Category: Root Cause Analysis

Career Development For Open Source Data Engineering

Course 6, 2 hours

What you'll learn

Build a data engineering portfolio with end-to-end pipeline projects that prove your ability to design, build, and deploy production-style systems.
Create a resume, LinkedIn profile, and GitHub presence that position you as a hands-on data engineer ready to contribute from day one.
Practice real data engineering interview scenarios and develop structured responses to technical, design, and behavioral questions.
Execute a 30-day career launch plan covering portfolio completion, job applications, and networking in the data engineering community.