Multimodal Intelligence - Vision, Audio & Language in Action Professional Certificate

Build and Deploy Multimodal AI Systems.

Design, train, evaluate, and deploy multimodal AI systems that process text, images, and audio.

Instructor: Professionals from the Industry

Access provided by Universitas Indonesia

5 course series

Earn a career credential that demonstrates your expertise

Intermediate level

Recommended experience

4 weeks to complete

at 10 hours a week

Flexible schedule

Learn at your own pace

5 course series

Earn a career credential that demonstrates your expertise

Intermediate level

Recommended experience

4 weeks to complete

at 10 hours a week

Flexible schedule

Learn at your own pace

What you'll learn

Design end-to-end multimodal AI architectures that integrate image, audio, and text data streams into scalable production pipelines.
Fine-tune transformer-based multimodal models using transfer learning and evaluate performance with cross-modal and ethical AI metrics.
Build automated ETL pipelines and unified data schemas to ingest, validate, and store multimodal features for model training and inference.
Deploy versioned, secured, and documented inference APIs on containerized Kubernetes infrastructure with real-time performance optimization.

Skills you'll gain

Tools you'll learn

Details to know

Shareable certificate

Add to your LinkedIn profile

Taught in English

See how employees at top companies are mastering in-demand skills

Learn more about Coursera for Business

logos of Petrobras, TATA, Danone, Capgemini, P&G and L'Oreal

Advance your career with in-demand skills

Receive professional-level training from Coursera
Demonstrate your technical proficiency
Earn an employer-recognized certificate from Coursera

Professional Certificate - 5 course series

This program gives you the practical multimodal AI skills employers look for in today's machine learning and applied AI teams. You will learn how to process and augment image, audio, and text data; fine-tune transformer-based models using transfer learning; build automated ETL pipelines and unified data schemas; and deploy inference services on containerized cloud infrastructure. Each course builds directly on the last, moving you from data preparation and model training through evaluation, optimization, and production deployment.

Throughout the program, you will work with realistic engineering scenarios and professional ML workflows. You will write preprocessing pipelines for multiple data types, fine-tune pre-trained multimodal models in PyTorch, diagnose training failures using gradient analysis, evaluate model fairness with bias audits and SHAP interpretability reports, build cross-modal retrieval systems using FAISS, and deploy versioned REST APIs secured with OAuth2 and monitored with Prometheus — all within a containerized Kubernetes environment managed through CI/CD pipelines.

By the time you complete this program, you will have a portfolio of working, production-oriented code that demonstrates your ability to handle the core responsibilities of an ML engineer, multimodal AI practitioner, or MLOps specialist. Intermediate Python and foundational machine learning experience is recommended to get the most from this program.

Applied Learning Project

Each course culminates in a hands-on project where you build and connect real components of a multimodal AI pipeline — from writing preprocessing scripts and fine-tuning models to configuring ETL workflows, securing inference APIs, and deploying containerized services on cloud GPU infrastructure. These projects reflect the exact challenges you will face as an ML engineer or AI practitioner, giving you a portfolio of working, production-oriented code to demonstrate your capabilities to employers.

Solution Architecture and Ethical AI Design

Course 1, 4 hours

What you'll learn

Design end-to-end multimodal AI architectures that integrate image, audio, and text pipelines into scalable, production-ready systems.
Evaluate multimodal model performance using cross-modal metrics including FID, CLIP scores, recall@k, and Visual Question Answering accuracy.
Apply ethical AI frameworks to assess model bias using demographic parity and equalized odds across sensitive population subgroups.
Generate model interpretability reports using LIME and SHAP to explain AI predictions and communicate findings to technical stakeholders.

Skills you'll gain

Category: Solution Architecture

Category: Responsible AI

Category: Technical Documentation

Category: Model Evaluation

Category: Natural Language Processing

Category: AI Integrations

Category: Artificial Intelligence and Machine Learning (AI/ML)

Category: Image Quality

Category: Data Science

Category: Machine Learning

Category: Enterprise Architecture

Category: Data Ethics

Category: Generative Model Architectures

Category: Systems Architecture

Category: Algorithms

Category: Computer Science

Category: AI Orchestration

Category: Scalability

Category: Solution Design

Category: Software Documentation

End-to-End Multimodal AI: Fine-Tuning, Fusion, and MLOps

Course 2, 17 hours

What you'll learn

Fine-tune transformer-based multimodal models using transfer learning in PyTorch and TensorFlow.
Build cross-modal retrieval systems using FAISS and attention-based fusion of visual and text embeddings.
Automate ML pipelines with drift monitoring, hyperparameter tuning, and retraining using MLflow and Ray Tune.
Design and document versioned multimodal inference APIs with FastAPI, OAuth2, and OpenAPI specifications.

Skills you'll gain

Category: MLOps (Machine Learning Operations)

Category: Model Optimization

Category: API Design

Category: Fine-tuning

Category: Transfer Learning

Category: Model Training

Category: Vision Transformer (ViT)

Category: Machine Learning Algorithms

Category: Data Architecture

Category: OAuth

Category: Machine Learning Software

Category: Model Evaluation

Category: Solution Architecture

Category: Restful API

Category: Application Programming Interface (API)

Category: Data Science

Category: Artificial Intelligence and Machine Learning (AI/ML)

Category: Technical Communication

Category: Model Deployment

Category: Machine Learning

Preparing Multimodal Data: Vision, Audio, and NLP Pipelines

Course 3, 11 hours

What you'll learn

Preprocess images and video using normalization, color-space conversion, and motion extraction techniques.
Build audio feature extraction and augmentation pipelines using MFCCs and spectral transforms.
Fine-tune transformer models and construct text preprocessing pipelines for NLP applications.
Evaluate and debug multimodal AI models using automatic metrics and human-in-the-loop frameworks.

Skills you'll gain

Category: Data Preprocessing

Category: Computer Vision

Category: Data Transformation

Category: Model Training

Category: Feature Engineering

Category: Natural Language Processing

Category: Data Pipelines

Category: Model Evaluation

Category: Image Quality

Category: Image Analysis

Category: Data Architecture

Category: Artificial Neural Networks

Category: Large Language Modeling

Category: Artificial Intelligence and Machine Learning (AI/ML)

Category: Machine Learning Methods

Category: Hugging Face

Category: Data Processing

Category: Machine Learning Algorithms

Category: Machine Learning Software

Category: Fine-tuning

Production-Ready Multimodal ML Engineering

Course 4, 12 hours

What you'll learn

Design a multimodal feature store and build automated ETL pipelines using BigQuery and Airflow.
Write test-driven ML training code and validate multimodal datasets for production readiness.
Optimize model inference with TensorRT and manage ML codebases using GitFlow and CI/CD tools.
Deploy GPU-accelerated services on Kubernetes and tune autoscaling for real-time performance.

Skills you'll gain

Category: Data Pipelines

Category: Containerization

Category: Data Validation

Category: Extract, Transform, Load

Category: Apache Airflow

Category: Model Training

Category: Test Driven Development (TDD)

Category: Kubernetes

Category: Artificial Intelligence and Machine Learning (AI/ML)

Category: Artificial Intelligence

Category: Machine Learning Algorithms

Category: Data Infrastructure

Category: Machine Learning Software

Category: Natural Language Processing

Category: Model Deployment

Category: MLOps (Machine Learning Operations)

Category: Algorithms

Category: Data Collection

Category: Model Optimization

Category: Artificial Neural Networks

Career Development for Multimodal Intelligence

Course 5, 2 hours

What you'll learn

Build multimodal AI systems that integrate vision, audio, and language using cross-attention fusion and transformer architectures.
Deploy production-ready multimodal models with optimized inference pipelines, containerization, and automated MLOps workflows.
Architect cross-modal retrieval and fusion systems using contrastive learning and embedding alignment for real-world applications.