Stream & Unify Data Schemas with CDC

This course is part of Real-Time, Real Fast: Kafka & Spark for Data Engineers Specialization

Instructors: Starweaver

Access provided by Innovecs

3 modules

Gain insight into a topic and learn the fundamentals.

Intermediate level

Recommended experience

5 hours to complete

Flexible schedule

Learn at your own pace

3 modules

Gain insight into a topic and learn the fundamentals.

Intermediate level

Recommended experience

5 hours to complete

Flexible schedule

Learn at your own pace

What you'll learn

Explain CDC fundamentals (binlog/WAL) and schema evolution strategies.
Configure a Schema Registry pipeline locally using Debezium and Kafka.
Use streaming SQL (Flink/ksqlDB) to map, cast, and merge divergent schemas into a canonical model.

Skills you'll gain

Details to know

Shareable certificate

Add to your LinkedIn profile

Assessments

5 assignments¹

AI Graded see disclaimer

Taught in English

See how employees at top companies are mastering in-demand skills

Learn more about Coursera for Business

logos of Petrobras, TATA, Danone, Capgemini, P&G and L'Oreal

Build your subject-matter expertise

This course is part of the Real-Time, Real Fast: Kafka & Spark for Data Engineers Specialization

When you enroll in this course, you'll also be enrolled in this Specialization.

Learn new concepts from industry experts
Gain a foundational understanding of a subject or tool
Develop job-relevant skills with hands-on projects
Earn a shareable career certificate

There are 3 modules in this course

Imagine deploying schema changes with confidence—knowing your pipeline will handle them gracefully, consumers will stay healthy, and your data will stay consistent. That's the difference between hoping your CDC pipeline works and knowing it will. In this course you will learn how to build a working, vendor‑neutral CDC pipeline and a single, unified table from evolving source schemas. Starting with Debezium streaming changes from Postgres/MySQL into Kafka, you will use Schema Registry to enforce compatibility, then apply streaming SQL in Flink (or ksqlDB) to map, cast, and merge divergent fields into a canonical model. Finally, you will persist results to an Apache Iceberg table and query it instantly with Trino. Along the way, you’ll learn practical strategies to manage schema drift, choose compatibility modes (backward/full), and avoid breaking downstream consumers. Everything runs locally with Docker so you can reproduce it anywhere and take the same patterns to your cloud stack later.

This course is designed for engineers working with Kafka, Debezium, and streaming SQL who need reliable schema evolution and canonical modeling skills. Learners should be familiar with Basic SQL, Docker, and familiarity with Kafka or streaming concepts. By the end of the course,you will be able to implement a small end‑to‑end CDC pipeline that streams from a source DB and unifies evolving schemas into a single queryable table.

Deploy a local Debezium, Kafka, Schema Registry, and Flink/ksqlDB stack to observe row-level changes in real-time. Intentionally modify the source schema, then employ streaming SQL to map, cast, and coalesce fields into a canonical table. Perform upserts using stable keys and verify the data is correctly stored in Iceberg. By the conclusion, you will have established an operational CDC loop and a unified, queryable dataset.

What's included

4 videos2 readings1 assignment

4 videos Total 37 minutes

Introduction and Welcome 4 minutes
CDC to Analytics: Complete Architecture Overview 11 minutes
Data Flow Deep Dive: Source to Lakehouse 12 minutes
Live Build: Unify Schemas with Streaming SQL 10 minutes

2 readings Total 10 minutes

Welcome to the Course: Course Overview 5 minutes
Schema Evolution Additional Resources 5 minutes

1 assignment Total 30 minutes

Hands On Learning (HOL): CDC Basics & Safe Schema Evolution 30 minutes

Learn to prevent consumer disruptions by enforcing compatibility at both the subject and global levels. We will deliberately deploy an incompatible schema, observe the failure, and proceed safely using defaults and transitive modes. Implement practical safeguards such as CI schema checks, DLQs, alerts, and lag probes to ensure issues are promptly identified and contained. The emphasis is on repeatable recovery, not heroics.

What's included

3 videos1 reading1 assignment

Develop a robust canonical model encompassing naming conventions, data types and units, nullability, and soft delete mechanisms, and store it in Iceberg on MinIO utilizing streaming upserts. Perform immediate queries with Trino and employ time-travel features for validation or debugging regressions. The project involves constructing a denormalized “latest per customer” view for analytical purposes, as well as discussing partitioning strategies, equality deletes, and data compaction. Participants will acquire scalable patterns suitable for deployment from laptops to cloud environments.