When you enroll in this course, you'll also be enrolled in this Specialization.
Learn new concepts from industry experts
Gain a foundational understanding of a subject or tool
Develop job-relevant skills with hands-on projects
Earn a shareable career certificate
There are 4 modules in this course
Step into the frontier of artificial intelligence with this advanced course designed to explore the latest models powering visual and multimodal intelligence. From foundational mathematical tools to state-of-the-art architectures, you'll gain the skills to understand and build systems that interpret images, text, and more—just like today’s leading AI models.
You'll begin by discovering how Nonlinear Support Vector Machines (NSVMs) and Fourier transforms lay the groundwork for signal processing and pattern recognition in visual data. You'll then build a strong foundation in probabilistic reasoning and temporal modeling with RNNs, enabling AI systems to understand sequences and context. After, you'll learn how transformer architectures revolutionize both language and vision tasks. Finally, you'll dive into multimodal learning with CLIP, which connects images and text, and explore diffusion models that generate high-fidelity images through iterative refinement.
This course is ideal for learners who want to go beyond traditional deep learning and explore the models shaping the future of AI. With a blend of theory, code, and real-world applications, you'll be equipped to tackle cutting-edge challenges in computer vision and multimodal AI.
This course can be taken for academic credit as part of CU Boulder’s MS in Data Science or MS in Computer Science degrees offered on the Coursera platform. These fully accredited graduate degrees offer targeted courses, short 8-week sessions, and pay-as-you-go tuition. Admission is based on performance in three preliminary courses, not academic history. CU degrees on Coursera are ideal for recent graduates or working professionals. Learn more:
MS in Data Science: https://www.coursera.org/degrees/master-of-science-data-science-boulder
MS in Computer Science: https://coursera.org/degrees/ms-computer-science-boulder
Welcome to Modern AI Models for Vision and Multimodal Understanding, the third course in the Computer Vision specialization. In this first module, you’ll explore foundational mathematical tools used in modern AI models for vision and multimodal understanding. You’ll begin with Support Vector Machines (SVMs), learning how linear and radial basis function (RBF) kernels define decision boundaries and how support vectors influence classification. Then, you’ll dive into the Fourier Transform, starting with 1D signals and progressing to 2D applications. You’ll learn how to move between time/spatial and frequency domains using the Discrete Fourier Transform (DFT) and its inverse, and how these transformations reveal patterns and structures in data. By the end of this module, you’ll understand how SVMs and Fourier analysis contribute to feature extraction, signal decomposition, and model interpretability in AI systems.
What's included
14 videos8 readings5 assignments
Show info about module content
14 videos•Total 85 minutes
Meet Your Instructor •3 minutes
Linear SVM•11 minutes
Visualize Linear•8 minutes
Radial Basis Function (RBF)•6 minutes
RBF Kernel•4 minutes
Visualize a RBF SVM•10 minutes
1D DFT•6 minutes
1D Inverse DFT •7 minutes
1D Basic Functions•5 minutes
Frequency and Time•6 minutes
2D DFT•7 minutes
2D Inverse DFT•3 minutes
2D Basic Functions•5 minutes
Frequency and Spatial •4 minutes
8 readings•Total 50 minutes
Course Updates and Accessibility Support•1 minute
Earn Academic Credit for your Work!•10 minutes
Course Support•10 minutes
Inside the Course•5 minutes
Assessment Expectations•10 minutes
AI Citation and Acknowledgement•10 minutes
Get the Workbook: SVM•2 minutes
Get the Workbook: Fourier 1D & 2D•2 minutes
5 assignments•Total 80 minutes
Support Vector Machine (SVM)•15 minutes
Fourier 1D•15 minutes
Fourier 2D•15 minutes
AI Policy Quiz•5 minutes
SMV and Fourier•30 minutes
Probability and RNN
Module 2•4 hours to complete
Module details
This module invites you to explore how probability theory and sequential modeling power modern AI systems. You’ll begin by examining how conditional and joint probabilities shape predictions in language and image models, and how the chain rule enables structured generative processes. Then, you’ll transition to recurrent neural networks (RNNs), learning how they handle sequential data through hidden states and feedback loops. You’ll compare RNNs to feedforward models, explore architectures like one-to-many and sequence-to-sequence, and address challenges like vanishing gradients. By the end, you’ll understand how probabilistic reasoning and temporal modeling combine to support tasks ranging from text generation to autoregressive image synthesis.
What's included
15 videos2 readings5 assignments
Show info about module content
15 videos•Total 123 minutes
Probability in Language Models •10 minutes
Conditional Probabilities •9 minutes
The Chain Rule of Probabilities•11 minutes
Calculating Joint Probabilities •12 minutes
Pixel-Base Image Models•13 minutes
Autoregressive Image Model•16 minutes
Attention Mechanisms in Transformer Models•14 minutes
Batch vs Recurrent•4 minutes
MLP vs RNN•12 minutes
Many to One•4 minutes
One to Many•2 minutes
One to One•6 minutes
Sequence to Sequence•2 minutes
Deep RNN•5 minutes
Autoregressive RNN•3 minutes
2 readings•Total 4 minutes
Get the Workbook: Probability•2 minutes
Get the Workbook: RNN•2 minutes
5 assignments•Total 90 minutes
Probability Part One•15 minutes
Probability Part Two•15 minutes
RNN Part One•15 minutes
RNN Part Two•15 minutes
Probability and RNN•30 minutes
Transformer and ViT
Module 3•3 hours to complete
Module details
This module explores how attention-based architectures have reshaped the landscape of deep learning for both language and vision. You’ll begin by unpacking the mechanics of the Transformer, including self-attention, multi-head attention, and the encoder-decoder structure that enables parallel sequence modeling. Then, you’ll transition to Vision Transformers (ViTs), where images are tokenized and processed using the same principles that revolutionized NLP. Along the way, you’ll examine how normalization, positional encoding, and projection layers contribute to model performance. By the end, you’ll understand how Transformers and ViTs unify sequence and spatial reasoning in modern AI systems.
What's included
15 videos2 readings5 assignments
Show info about module content
15 videos•Total 81 minutes
Batch vs Recurrent vs Attention•7 minutes
Attention + MLP•5 minutes
Dot-Product Self-Attention•4 minutes
QKV Self-Attention•4 minutes
Transformer Encoder•4 minutes
Self vs Cross Attention•5 minutes
Encoder and Decoder for Transformer•7 minutes
Decoder Output Layer•3 minutes
Image to Tokens•11 minutes
Normalization for ViT•4 minutes
Self-Attention for ViT•6 minutes
Multi-Head Attention•9 minutes
MLP Forward Feed•4 minutes
ViT Output Layer•5 minutes
Loss Gradient for ViT•4 minutes
2 readings•Total 4 minutes
Get the Workbook: Transformer•2 minutes
Get the Workbook: ViT•2 minutes
5 assignments•Total 90 minutes
Transformer Part One•15 minutes
Transformer Part Two•15 minutes
ViT Part One•15 minutes
ViT Part Two•15 minutes
Transformer and ViT•30 minutes
CLIP and Diffusion
Module 4•3 hours to complete
Module details
In this module, you’ll explore two transformative approaches in multimodal and generative AI. First, you’ll dive into CLIP, a model that learns a shared embedding space for images and text using contrastive pre-training. You’ll see how CLIP enables zero-shot classification by comparing image embeddings to textual descriptions, without needing labeled training data. Then, you’ll shift to diffusion models, which generate images through a gradual denoising process. You’ll learn how noise prediction, time conditioning, and reverse diffusion combine to produce high-quality samples. This module highlights how foundational models can bridge modalities and synthesize data with remarkable flexibility.
What's included
11 videos2 readings4 assignments
Show info about module content
11 videos•Total 75 minutes
Batch of Pairs•6 minutes
Image Encoder (Batch)•6 minutes
Text Encoder (Batch)•10 minutes
Joint Embedding•5 minutes
Contrastive Pre-Training•13 minutes
Zero-Shot Image Classifier•6 minutes
Zero-Shot Image Prediction•7 minutes
Diffusion Introduction•5 minutes
Noise Prediction•6 minutes
Time Conditioning and Parallel Training•5 minutes
Reverse Diffusion•6 minutes
2 readings•Total 4 minutes
Get the Workbook: CLIP•2 minutes
Get the Workbook: Diffusion•2 minutes
4 assignments•Total 75 minutes
CLIP Part One•15 minutes
CLIP Part Two•15 minutes
Diffusion•15 minutes
CLIP and Diffusion•30 minutes
Earn a career certificate
Add this credential to your LinkedIn profile, resume, or CV. Share it on social media and in your performance review.
Build toward a degree
This course is part of the following degree program(s) offered by University of Colorado Boulder. If you are admitted and enroll, your completed coursework may count toward your degree learning and your progress can transfer with you.¹
View eligible degrees
Build toward a degree
This course is part of the following degree program(s) offered by University of Colorado Boulder. If you are admitted and enroll, your completed coursework may count toward your degree learning and your progress can transfer with you.¹
¹Successful application and enrollment are required. Eligibility requirements apply. Each institution determines the number of credits recognized by completing this content that may count towards degree requirements, considering any existing credits you may have. Click on a specific course for more information.
OK
Instructor
Instructor ratings
Instructor ratings
We asked all learners to give feedback on our instructors based on the quality of their teaching style.
CU Boulder is a dynamic community of scholars and learners on one of the most spectacular college campuses in the country. As one of 34 U.S. public institutions in the prestigious Association of American Universities (AAU), we have a proud tradition of academic excellence, with five Nobel laureates and more than 50 members of prestigious academic academies.
When will I have access to the lectures and assignments?
To access the course materials, assignments and to earn a Certificate, you will need to purchase the Certificate experience when you enroll in a course. You can try a Free Trial instead, or apply for Financial Aid. The course may offer 'Full Course, No Certificate' instead. This option lets you see all course materials, submit required assessments, and get a final grade. This also means that you will not be able to purchase a Certificate experience.
What will I get if I subscribe to this Specialization?
When you enroll in the course, you get access to all of the courses in the Specialization, and you earn a certificate when you complete the work. Your electronic Certificate will be added to your Accomplishments page - from there, you can print your Certificate or add it to your LinkedIn profile.
Is financial aid available?
Yes. In select learning programs, you can apply for financial aid or a scholarship if you can’t afford the enrollment fee. If fin aid or scholarship is available for your learning program selection, you’ll find a link to apply on the description page.