Data Engineering Essentials

This course is part of Hands-On MLOps Fundamentals for ML Engineers Specialization

Instructor: Mumshad Mannambeth

Access provided by SGCSRC

4 modules

Gain insight into a topic and learn the fundamentals.

Beginner level

Recommended experience

5 hours to complete

Flexible schedule

Learn at your own pace

4 modules

Gain insight into a topic and learn the fundamentals.

Beginner level

Recommended experience

5 hours to complete

Flexible schedule

Learn at your own pace

What you'll learn

Build scalable data pipelines using Pandas Polars and Apache Spark for diverse dataset sizes
Architect real time streaming solutions with Apache Kafka and feature stores for live ML inference
Automate complex ML workflows using Airflow and Prefect to ensure reliable continuous training

Skills you'll gain

Tools you'll learn

Details to know

Shareable certificate

Add to your LinkedIn profile

Assessments

4 assignments

Taught in English

See how employees at top companies are mastering in-demand skills

Learn more about Coursera for Business

logos of Petrobras, TATA, Danone, Capgemini, P&G and L'Oreal

Build your subject-matter expertise

This course is part of the Hands-On MLOps Fundamentals for ML Engineers Specialization

When you enroll in this course, you'll also be enrolled in this Specialization.

Learn new concepts from industry experts
Gain a foundational understanding of a subject or tool
Develop job-relevant skills with hands-on projects
Earn a shareable career certificate

There are 4 modules in this course

This course bridges the gap between raw data and production-ready AI systems. In 2026, the value of a machine learning model is defined by the reliability of the data pipelines that feed it. This program transforms you into an MLOps-ready engineer capable of building automated, scalable, and observable data architectures.

You will start by mastering the MLOps lifecycle, learning why traditional DevOps isn't enough for the unique challenges of data and model drift. Moving into the technical core, you will learn to build resilient ETL pipelines using modern tools like Pandas and Polars for medium datasets, before scaling up to distributed processing with Apache Spark and Dask. The course features heavy emphasis on real-time streaming with Apache Kafka and the implementation of Feature Stores to solve the dreaded "training-serving skew." Finally, you will tie everything together through workflow orchestration using Airflow and Prefect, ensuring your data flows are not just functional, but production-grade, automated, and fully monitored. Course Highlights - Industry-Standard Stack: Hands-on experience with Kafka, Spark, Airflow, and Feature Stores. - Production-First Mindset: Focus on CI/CD/CT (Continuous Training) and data governance. - Hands-on Labs: Every module concludes with a practical lab to build your professional portfolio. - Scalability Focused: Transition from local Python scripts to distributed cloud-scale architectures.

Explore the foundational shift from traditional software development to data-centric machine learning operations. You will compare DevOps and MLOps workflows while mastering the core pillars of CI, CD, CT, and CM. This section establishes the architectural blueprint for building reliable and automated machine learning systems.

What's included

10 videos3 readings1 assignment

10 videosTotal 44 minutes

Course Introduction2 minutes
Getting Started with Machine Learning Team5 minutes
Introducing MLOps Engineer5 minutes
DevOps and MLOps - A Comparison7 minutes
MLOps LifeCycle4 minutes
Continuous Integration (CI), Continuous Deployment (CD)4 minutes
Continuous Training (CT), Continuous Monitoring (CM)4 minutes
Finding and Exploring Right Tools from DevOps for MLOps - Part 14 minutes
Finding and Exploring Right Tools from DevOps for MLOps - Part 23 minutes
MLOps Architecture5 minutes

3 readingsTotal 21 minutes

GitHub Repo1 minute
Quiz - Introduction to MLOps10 minutes
How to Reach Out and Engage with the Community10 minutes

1 assignmentTotal 30 minutes

Quiz: Introduction to MLOps30 minutes

Master the essential techniques for collecting and preparing high-quality data for machine learning models. You will implement robust ETL processes and explore the strategic role of Data Lakes in modern ML stacks. Hands-on labs with Pandas and Polars will provide practical experience in transforming raw datasets into clean features.

What's included

7 videos2 readings1 assignment

7 videosTotal 37 minutes

Data Collection and Preparation4 minutes
Data Ingestion - ETL4 minutes
Idea of Data Lake5 minutes
Data Cleaning and Data Transformation6 minutes
Demo 1: Small to Medium Datasets Transformation (Pandas, Polars)7 minutes
Demo 2: Small to Medium Datasets Transformation (Pandas, Polars)7 minutes
Demo 3: Small to Medium Datasets Transformation (Pandas, Polars)6 minutes

2 readingsTotal 20 minutes

Lab: Small to Medium Datasets Data Transformation10 minutes
Quiz - Data Collection and Preparation - Set 110 minutes

1 assignmentTotal 30 minutes

Quiz: Data Foundations & Transformation30 minutes

Scale your engineering capabilities to handle massive datasets and real-time information flows. This module introduces distributed computing with Apache Spark and Dask alongside high-velocity streaming via Apache Kafka. You will also evaluate the critical role of Feature Stores in maintaining consistency between training and serving.

What's included

7 videos1 reading1 assignment

7 videosTotal 34 minutes

Large Datasets: Apache Spark (PySpark), Dask4 minutes
Streaming Datasets: Apache Kafka, Apache Flink6 minutes
Demo 1: Stream Data using Apache Kafka3 minutes
Demo 2: Stream Data using Apache Kafka4 minutes
Demo 3: Stream Data using Apache Kafka5 minutes
What is Feature Store?6 minutes
Benefits of using a Feature Store5 minutes

1 readingTotal 10 minutes

Lab: Stream Data using Apache Kafka10 minutes

1 assignmentTotal 30 minutes

Quiz: Big Data & Streaming for ML30 minutes

Connect individual data tasks into a seamless and automated production pipeline using Airflow and Prefect. You will learn to manage complex dependencies and schedule automated training triggers to ensure model performance over time. This section focuses on making your data workflows resilient through advanced monitoring and error handling.