Unify Modalities: Cross-Modal Retrieval

This course is part of multiple programs.

Instructor: Hurix Digital

Access provided by ExxonMobil

2 modules

Gain insight into a topic and learn the fundamentals.

Intermediate level

Recommended experience

2 hours to complete

Flexible schedule

Learn at your own pace

2 modules

Gain insight into a topic and learn the fundamentals.

Intermediate level

Recommended experience

2 hours to complete

Flexible schedule

Learn at your own pace

What you'll learn

Cross-modal retrieval aligns vector spaces to bridge semantic gaps between text, images, and other data types.
ANN tools like FAISS enable fast similarity search across millions of embeddings with production-scale performance.
Attention mechanisms fuse visual and textual features by learning contextual relationships across multiple representations.
Multimodal systems balance accuracy, speed, and memory through careful index choice and parameter tuning.

Skills you'll gain

Tools you'll learn

Vector Databases

Details to know

Shareable certificate

Add to your LinkedIn profile

Assessments

4 assignments¹

AI Graded see disclaimer

Taught in English

See how employees at top companies are mastering in-demand skills

Learn more about Coursera for Business

logos of Petrobras, TATA, Danone, Capgemini, P&G and L'Oreal

Build your subject-matter expertise

This course is available as part of

When you enroll in this course, you'll also be asked to select a specific program.

Learn new concepts from industry experts
Gain a foundational understanding of a subject or tool
Develop job-relevant skills with hands-on projects
Earn a shareable career certificate

There are 2 modules in this course

Transform how AI systems understand and connect different data modalities. This course empowers machine learning professionals to build cutting-edge cross-modal retrieval systems that bridge the gap between text and images. You'll master the technical implementation of approximate nearest-neighbor search algorithms and design sophisticated attention mechanisms that fuse visual and textual information. Through hands-on work with production-scale tools like FAISS and real datasets like Flickr30K, you'll develop the expertise to create intelligent systems that understand content across modalities—enabling breakthrough applications in search, recommendation, and content understanding that mirror how humans naturally process diverse information types.

Learners will build foundational understanding of cross-modal retrieval systems and implement approximate nearest-neighbor search algorithms using FAISS for production-scale similarity search across multimodal embeddings.

What's included

1 video2 readings1 assignment1 ungraded lab

1 videoTotal 7 minutes

Fundamentals of Cross-Modal Retrieval Systems7 minutes

2 readingsTotal 18 minutes

FAISS Architecture and Index Types for Production Systems10 minutes
Implementing FAISS Indexing for Cross-Modal Search8 minutes

1 assignmentTotal 3 minutes

Cross-Modal Retrieval and FAISS Implementation Assessment3 minutes

1 ungraded labTotal 15 minutes

Building Production-Scale Cross-Modal Retrieval with FAISS15 minutes

Learners will design and implement sophisticated attention-based fusion algorithms that intelligently combine visual and textual embeddings, mastering the creation of multimodal neural architectures for advanced cross-modal AI applications.