Large Multimodal Model Prompting with Gemini

Large Multimodal Model Prompting with Gemini

Instructor: Erwin Huizenga

Access provided by Emerson Electric

3,214 already enrolled

Project

Build in-demand job skills with step-by-step instructions

4.7

(34 reviews)

Beginner level

Recommended experience

2 hours

Learn at your own pace

Hands-on learning

Learn more

Project

Build in-demand job skills with step-by-step instructions

4.7

(34 reviews)

Beginner level

Recommended experience

2 hours

Learn at your own pace

Hands-on learning

Learn more

What you'll learn

Learn state-of-the-art techniques for getting the most out of multimodal AI with Google’s Gemini model family.
Leverage the power of Gemini’s cross-modal attention to fuse information from text, images, and video for complex reasoning tasks.
Extend Gemini’s capabilities with external knowledge and live data via function calling and API integration.

Skills you'll practice

Tools you'll use

Details to know

Taught in English

No downloads or installation required

Only available on desktop

See how employees at top companies are mastering in-demand skills

Learn more about Coursera for Business

logos of Petrobras, TATA, Danone, Capgemini, P&G and L'Oreal

Learn, practice, and apply job-ready skills in less than 2 hours

Receive training from industry experts
Gain hands-on experience solving real-world job tasks

About this project

Multimodal models like Gemini are pushing the boundaries of what’s possible by unifying traditionally siloed data modalities. With Gemini, you can build applications that seamlessly understand and reason across text, images, and videos, enabling a new class of intelligent systems. For example, building a virtual interior designer that can analyze a user’s room images, understand their style preferences from a text description, and generate personalized design recommendations. Or creating a smart document processing pipeline that can extract structured data from complex PDFs, answer questions based on the content, and generate human-like summaries.

You’ll learn prompt engineering techniques to guide Gemini’s behavior and optimize its performance for diverse use cases, from creative story generation to analytical report writing. And you’ll discover how to integrate Gemini with external APIs and databases using function calling, with the ability to infuse your applications with real-time data and dynamic content. What you’ll learn, in detail: 1. Introduction to Gemini Models: Explore the Gemini model family, and understand the key differences and use cases for Gemini Nano, Pro, Flash, and Ultra. Understand how to select optimal models based on capability, latency, and cost considerations. 2. Multimodal Prompting and Parameter Control: Learn advanced techniques for structuring effective text-image-video prompts to elicit desired model behavior. Fine-tune key parameters like temperature, top_p, top_k to control model creativity vs determinism. 3. Best Practices for Multimodal Prompting: Get experience with prompt engineering for Gemini multimodal models, and best practices around role assignment, task decomposition, and formatting. Analyze the impact of prompt-image ordering on model performance for different objectives. 4. Creating Use Cases with Images: Build engaging multimodal applications like interior design assistants and receipt itemization tools. Leverage Gemini’s cross-modal reasoning capabilities to analyze relationships between entities across multiple images. 5. Developing Use Cases with Videos: Implement “needle in the haystack” semantic video search powered by Gemini’s large context window. Explore techniques for long-form video QA and content summarization. 6. Integrating Real-Time Data with Function Calling: Extend Gemini with external knowledge and live data via function calling and API integration. Combine Gemini’s Natural Language Understanding (NLU) capabilities with APIs for up-to-date facts and interactive services. Through this course, you’ll become well-versed in Gemini’s capabilities, how to maximize them in different use cases, and a portfolio of practical techniques for architecting advanced multimodal AI applications. Note that due to technical requirements, this course features downloadable-only notebooks on the learning platform. You are free to download, review, and run these notebooks on your own.

Instructor

Instructor ratings

(10 ratings)

Erwin Huizenga

DeepLearning.AI

2 Courses7,586 learners

Offered by

DeepLearning.AI

How you'll learn

Hands-on, project-based learning
Practice new skills by completing job-related tasks with step-by-step instructions.
No downloads or installation required
Access the tools and resources you need in a cloud environment.
Available only on desktop
This project is designed for laptops or desktop computers with a reliable Internet connection, not mobile devices.