Step into the frontier of artificial intelligence with this advanced course designed to explore the latest models powering visual and multimodal intelligence. From foundational mathematical tools to state-of-the-art architectures, you'll gain the skills to understand and build systems that interpret images, text, and more—just like today’s leading AI models.



Modern AI Models for Vision and Multimodal Understanding
This course is part of Computer Vision Specialization

Instructor: Tom Yeh
Access provided by Merck
Recommended experience
What you'll learn
- Apply Nonlinear Support Vector Machines (NSVMs) and Fourier transforms to analyze and process visual data. 
- Use probabilistic reasoning and implement Recurrent Neural Networks (RNNs) to model temporal sequences and contextual dependencies in visual data. 
- Explain the principles of transformer architectures and how Vision Transformers (ViT) perform image classification and visual understanding tasks. 
- Implement CLIP for multimodal learning, and utilize diffusion models to generate high-fidelity images. 
Skills you'll gain
Details to know

Add to your LinkedIn profile
18 assignments
August 2025
See how employees at top companies are mastering in-demand skills

Build your subject-matter expertise
- Learn new concepts from industry experts
- Gain a foundational understanding of a subject or tool
- Develop job-relevant skills with hands-on projects
- Earn a shareable career certificate

There are 4 modules in this course
Welcome to Modern AI Models for Vision and Multimodal Understanding, the third course in the Computer Vision specialization. In this first module, you’ll explore foundational mathematical tools used in modern AI models for vision and multimodal understanding. You’ll begin with Support Vector Machines (SVMs), learning how linear and radial basis function (RBF) kernels define decision boundaries and how support vectors influence classification. Then, you’ll dive into the Fourier Transform, starting with 1D signals and progressing to 2D applications. You’ll learn how to move between time/spatial and frequency domains using the Discrete Fourier Transform (DFT) and its inverse, and how these transformations reveal patterns and structures in data. By the end of this module, you’ll understand how SVMs and Fourier analysis contribute to feature extraction, signal decomposition, and model interpretability in AI systems.
What's included
14 videos7 readings4 assignments
This module invites you to explore how probability theory and sequential modeling power modern AI systems. You’ll begin by examining how conditional and joint probabilities shape predictions in language and image models, and how the chain rule enables structured generative processes. Then, you’ll transition to recurrent neural networks (RNNs), learning how they handle sequential data through hidden states and feedback loops. You’ll compare RNNs to feedforward models, explore architectures like one-to-many and sequence-to-sequence, and address challenges like vanishing gradients. By the end, you’ll understand how probabilistic reasoning and temporal modeling combine to support tasks ranging from text generation to autoregressive image synthesis.
What's included
15 videos2 readings5 assignments
This module explores how attention-based architectures have reshaped the landscape of deep learning for both language and vision. You’ll begin by unpacking the mechanics of the Transformer, including self-attention, multi-head attention, and the encoder-decoder structure that enables parallel sequence modeling. Then, you’ll transition to Vision Transformers (ViTs), where images are tokenized and processed using the same principles that revolutionized NLP. Along the way, you’ll examine how normalization, positional encoding, and projection layers contribute to model performance. By the end, you’ll understand how Transformers and ViTs unify sequence and spatial reasoning in modern AI systems.
What's included
15 videos2 readings5 assignments
In this module, you’ll explore two transformative approaches in multimodal and generative AI. First, you’ll dive into CLIP, a model that learns a shared embedding space for images and text using contrastive pre-training. You’ll see how CLIP enables zero-shot classification by comparing image embeddings to textual descriptions, without needing labeled training data. Then, you’ll shift to diffusion models, which generate images through a gradual denoising process. You’ll learn how noise prediction, time conditioning, and reverse diffusion combine to produce high-quality samples. This module highlights how foundational models can bridge modalities and synthesize data with remarkable flexibility.
What's included
11 videos2 readings4 assignments
Earn a career certificate
Add this credential to your LinkedIn profile, resume, or CV. Share it on social media and in your performance review.
Build toward a degree
This course is part of the following degree program(s) offered by University of Colorado Boulder. If you are admitted and enroll, your completed coursework may count toward your degree learning and your progress can transfer with you.¹
Instructor

Offered by
Why people choose Coursera for their career









