Images, audio, and video make up a growing share of the data companies generate today, but most pipelines are still built for structured data alone. This course teaches you to build AI-powered pipelines that process multimodal data and turn it into LLM-ready text.

Building Multimodal Data Pipelines


Recommended experience
What you'll learn
Extract structured, queryable data from unstructured images, audio, and video using OCR, ASR, and Vision Language Models.
Build a VLM-backed pipeline that reasons across video frames to generate timestamped scene descriptions and track events over time.
Build a multimodal RAG app on real-world data—turning raw images, audio, and video into a queryable interface with grounded, cited answers.
Skills you'll practice
Tools you'll use
Details to know
April 2026
Only available on desktop
See how employees at top companies are mastering in-demand skills

Learn, practice, and apply job-ready skills in less than 2 hours
- Receive training from industry experts
- Gain hands-on experience solving real-world job tasks

About this project
Instructor

How you'll learn
Hands-on, project-based learning
Practice new skills by completing job-related tasks with step-by-step instructions.
No downloads or installation required
Access the tools and resources you need in a cloud environment.
Available only on desktop
This project is designed for laptops or desktop computers with a reliable Internet connection, not mobile devices.
Why people choose Coursera for their career

Felipe M.

Jennifer J.

Larry W.






