Introduction to Multimodal AI with Hugging Face

Obtenez l'une de nos meilleures offres avec Coursera Plus pour 199 $ (habituellement 399 $). Économisez maintenant.

Ce cours n'est pas disponible en Français (France)

Nous sommes actuellement en train de le traduire dans plus de langues.

Introduction to Multimodal AI with Hugging Face

Instructeur : Hugging Face

Inclus avec

Demander à Coursera

4 modules

Obtenez un aperçu d'un sujet et apprenez les principes fondamentaux.

niveau Intermédiaire

Expérience recommandée

6 heures à compléter

Planning flexible

Apprenez à votre propre rythme

4 modules

Obtenez un aperçu d'un sujet et apprenez les principes fondamentaux.

niveau Intermédiaire

Expérience recommandée

6 heures à compléter

Planning flexible

Apprenez à votre propre rythme

Ce que vous apprendrez

Use vision-language models for image understanding and document extraction.
Build audio transcription, image generation, and agentic VLM/MCP workflows.
Apply multimodal safety filtering for responsible AI deployment.

Compétences que vous acquerrez

Catégorie : Image Analysis
Catégorie : AI Security
Catégorie : Agentic systems
Catégorie : Generative AI Agents
Catégorie : Responsible AI
Catégorie : Multimodal Prompts
Catégorie : LLM Application
Catégorie : Computer Vision
Catégorie : Retrieval-Augmented Generation
Catégorie : Fine-tuning
Catégorie : Large Language Modeling

Outils que vous découvrirez

Catégorie : Prompt Engineering
Catégorie : Model Deployment
Catégorie : AI Workflows
Catégorie : Model Context Protocol
Catégorie : Vision Transformer (ViT)
Catégorie : Agentic Workflows
Catégorie : Hugging Face
Catégorie : Generative AI

Détails à connaître

Certificat partageable

Ajouter à votre profil LinkedIn

Récemment mis à jour !

juin 2026

Évaluations

5 devoirs

Enseigné en Anglais

Découvrez comment les employés des entreprises prestigieuses maîtrisent des compétences recherchées

En savoir plus sur Coursera pour les affaires

logos de Petrobras, TATA, Danone, Capgemini, P&G et L'Oreal

Il y a 4 modules dans ce cours

By the end of this course, you will be able to:

• Explain how CLIP aligns image and text in a shared embedding space, use VLMs to perform visual question answering, image captioning, and document understanding, and navigate the Hub for multimodal models. • Build a pipeline that transcribes audio with Whisper and generates images with Diffusers, and describe how LoRA fine-tuning and multimodal RAG extend VLM capabilities. • Build an agentic workflow using smolagents with VLM support and MCP tool integration to automate multi-step tasks requiring vision and reasoning. • Apply ShieldGemma 2 to filter inputs and outputs of a VLM pipeline, test against adversarial inputs, and document failure modes for responsible deployment. AI that can only read text is already behind. This intermediate course assumes you're comfortable with the HF Transformers library and basic Gradio development. It opens with a practical challenge: 2,000 products with photos but no descriptions, and a stack of invoice PDFs that need structured data extraction. You’ll learn how CLIP aligned images and text in a shared space, then use modern vision-language models to caption products, answer questions about charts, and pull fields from invoices. Go wider: transcribe customer calls with Whisper, generate images from text briefs with Diffusers, and learn when to fine-tune a model versus when to give it better context through retrieval. Build agent workflows that can see screenshots, reason about what’s on screen, and connect to external tools through the Model Context Protocol (MCP) to act on what they find. The course closes with a deployment readiness review: your CTO wants to launch the AI pipeline next week, and you need to decide whether it’s safe to ship — with safety filtering, adversarial testing, and documented failure modes backing your recommendation.

Most AI models see one thing at a time — text or images, never both. Vision-language models change that, and the key insight starts with CLIP: images and text can live in the same embedding space. This module builds your multimodal mental model from CLIP to modern VLMs, then puts them to work on real tasks: visual question answering, image captioning, and document AI.

Inclus

4 vidéos1 lecture1 devoir1 laboratoire non noté

4 vidéosTotal 26 minutes

Welcome: From Text-Only to Multimodal AI4 minutes
How CLIP Aligns Images and Text in a Shared Space7 minutes
Visual Question Answering and Image Captioning with VLMs8 minutes
Document AI — OCR, Layout Parsing, and Structured Extraction8 minutes

1 lectureTotal 4 minutes

Multimodal Models and Hub Navigation Reference4 minutes

1 devoirTotal 30 minutes

Practice Assignment: Multimodal Foundations and VLMs30 minutes

1 laboratoire non notéTotal 18 minutes

Caption Products and Extract Invoice Data for BrightCart18 minutes

Multimodal AI isn’t limited to vision — audio transcription and image generation are equally practical capabilities that HF makes accessible through Whisper and Diffusers. This module covers both, then introduces the strategic decision every practitioner faces: when to fine-tune a model with LoRA versus when to use retrieval-augmented generation to give the model better context.

Inclus

3 vidéos1 lecture1 devoir1 laboratoire non noté

3 vidéosTotal 20 minutes

Transcribing Audio with Whisper7 minutes
Generating Images from Text with Diffusers7 minutes
When to Fine-Tune vs. When to Retrieve — LoRA and Multimodal RAG7 minutes

1 lectureTotal 4 minutes

Audio, Diffusers, and Adaptation Strategies Reference4 minutes

1 devoirTotal 30 minutes

Practice Assignment: Audio, Generation, and Adaptation Strategies30 minutes

1 laboratoire non notéTotal 20 minutes

Transcribe Calls and Generate Visual Summaries for BrightCart20 minutes

Running a single model is useful. Building a system where a model can see, reason, pick tools, act, and iterate — that’s an agent. This module teaches you to build agentic workflows with HF smolagents, connect agents to external tools via MCP (Model Context Protocol), and give agents vision capabilities so they can reason over screenshots and visual inputs.

Inclus

3 vidéos1 lecture1 devoir1 laboratoire non noté

3 vidéosTotal 27 minutes

Building Your First Agent with smolagents9 minutes
Connecting Agents to External Tools via MCP9 minutes
Vision-Powered Agents — Screenshot, Reason, Act, Iterate9 minutes

1 lectureTotal 3 minutes

Smolagents, MCP, and Agent Design Patterns Reference3 minutes

1 devoirTotal 30 minutes

Practice Assignment: Agents, MCP, and Tool Use30 minutes

1 laboratoire non notéTotal 18 minutes

Build an Agent That Automates BrightCart’s Catalog Workflow18 minutes

A multimodal system that works in a notebook can still fail catastrophically in production — generating harmful images, misreading sensitive documents, or amplifying biases across modalities. This module teaches you to wrap VLM pipelines with safety filtering, test against adversarial inputs, and document failure modes before anyone else finds them.

Inclus

4 vidéos2 lectures2 devoirs1 laboratoire non noté

4 vidéosTotal 24 minutes

Multimodal Safety Risks — What Can Go Wrong and Why6 minutes
Filtering with ShieldGemma 2 — Input and Output Safety8 minutes
Testing Against Adversarial Inputs and Documenting Failure Modes8 minutes
What You Can See, Build, and Ship Safely2 minutes

2 lecturesTotal 8 minutes

Multimodal Safety and Responsible Deployment Reference4 minutes
Applying Your Multimodal AI and Agent Skills4 minutes

2 devoirsTotal 60 minutes

Final Assessment: Introduction to Multimodal AI with HF 30 minutes
Practice Assignment: Responsible Deployment30 minutes

1 laboratoire non notéTotal 18 minutes

Wrap BrightCart’s VLM Pipeline with Safety Filtering18 minutes

Instructeur

Hugging Face

3 Cours39 apprenants

Offert par

Hugging Face

En savoir plus sur Entrepreneurship

DeepLearning.AI
Open Source Models with Hugging Face
Projet
Catégorie : Gratuit
Catégorie : Crédit proposé
Hugging Face
Getting Started with Hugging Face Transformers
Cours
Catégorie : Prévisualisation
Catégorie : Crédit proposé
Coursera
End-to-End Multimodal AI: Fine-Tuning, Fusion, and MLOps
Cours
Statut : Essai gratuit
Catégorie : Crédit proposé
Coursera
Multimodal Intelligence - Vision, Audio & Language in Action
Certificat Professionnel
Statut : Essai gratuit
Catégorie : Crédit proposé

Pour quelles raisons les étudiants sur Coursera nous choisissent-ils pour leur carrière ?

Felipe M.

Étudiant(e) depuis 2018

’Pouvoir suivre des cours à mon rythme à été une expérience extraordinaire. Je peux apprendre chaque fois que mon emploi du temps me le permet et en fonction de mon humeur.’

Jennifer J.

Étudiant(e) depuis 2020

’J'ai directement appliqué les concepts et les compétences que j'ai appris de mes cours à un nouveau projet passionnant au travail.’

Larry W.

Étudiant(e) depuis 2021

’Lorsque j'ai besoin de cours sur des sujets que mon université ne propose pas, Coursera est l'un des meilleurs endroits où se rendre.’

Chaitanya A.

’Apprendre, ce n'est pas seulement s'améliorer dans son travail : c'est bien plus que cela. Coursera me permet d'apprendre sans limites.’

Foire Aux Questions

To access the course materials, assignments and to earn a Certificate, you will need to purchase the Certificate experience when you enroll in a course. You can try a Free Trial instead, or apply for Financial Aid. The course may offer 'Full Course, No Certificate' instead. This option lets you see all course materials, submit required assessments, and get a final grade. This also means that you will not be able to purchase a Certificate experience.

When you purchase a Certificate you get access to all course materials, including graded assignments. Upon completing the course, your electronic Certificate will be added to your Accomplishments page - from there, you can print your Certificate or add it to your LinkedIn profile.

Yes. In select learning programs, you can apply for financial aid or a scholarship if you can’t afford the enrollment fee. If fin aid or scholarship is available for your learning program selection, you’ll find a link to apply on the description page.