Generative AI for Audio and Images: Models and Applications

Erwerben Sie mit Coursera Plus für 199 $ (regulär 399 $) das nächste Level. Jetzt sparen.

kurs ist nicht verfügbar in Deutsch (Deutschland)

Wir übersetzen es in weitere Sprachen. Sehen Sie sich die Sprachen an, die wir anbieten.

Generative AI for Audio and Images: Models and Applications

Dieser Kurs ist Teil von Spezialisierung für Generative AI Fundamentals

Dozent: Anahita Doosti

Bei enthalten

Mehr erfahren

4 Module

Verschaffen Sie sich einen Einblick in ein Thema und lernen Sie die Grundlagen.

3 Wochen zu vervollständigen

unter 10 Stunden pro Woche

Flexibler Zeitplan

In Ihrem eigenen Lerntempo lernen

4 Module

Verschaffen Sie sich einen Einblick in ein Thema und lernen Sie die Grundlagen.

3 Wochen zu vervollständigen

unter 10 Stunden pro Woche

Flexibler Zeitplan

In Ihrem eigenen Lerntempo lernen

Kompetenzen, die Sie erwerben

Kategorie: Autoencoders
Kategorie: Computer Vision
Kategorie: Music
Kategorie: Generative Adversarial Networks (GANs)
Kategorie: Vision Transformer (ViT)
Kategorie: Generative Model Architectures
Kategorie: Responsible AI
Kategorie: Recurrent Neural Networks (RNNs)
Kategorie: Generative AI
Kategorie: Embeddings
Kategorie: Image Analysis
Kategorie: Convolutional Neural Networks

Wichtige Details

Zertifikat zur Vorlage

Zu Ihrem LinkedIn-Profil hinzufügen

Kürzlich aktualisiert!

November 2025

Bewertungen

17 Aufgaben

Unterrichtet in Englisch

Erfahren Sie, wie Mitarbeiter führender Unternehmen gefragte Kompetenzen erwerben.

Weitere Informationen zu Coursera für Unternehmen

Logos von Petrobras, TATA, Danone, Capgemini, P&G und L'Oreal

Erweitern Sie Ihre Fachkenntnisse

Dieser Kurs ist Teil der Spezialisierung Spezialisierung für Generative AI Fundamentals

Wenn Sie sich für diesen Kurs anmelden, werden Sie auch für diese Spezialisierung angemeldet.

Lernen Sie neue Konzepte von Branchenexperten
Gewinnen Sie ein Grundverständnis bestimmter Themen oder Tools
Erwerben Sie berufsrelevante Kompetenzen durch praktische Projekte
Erwerben Sie ein Berufszertifikat zur Vorlage

In diesem Kurs gibt es 4 Module

Generative AI for Audio and Images: Models and Applications offers an in-depth exploration of how modern generative models such as Variational Autoencoders (VAEs), Generative Adversarial Networks (GANs), Transformers, and Diffusion models are used to create, manipulate, and enhance audio, image, and video content.

Learners examine the architectures, training processes, and use cases of these models across different modalities, gaining both conceptual understanding and practical insights through hands-on activities. The course also highlights the ethical and societal implications of generative AI, including bias, transparency, intellectual property, and the challenges of deepfake technologies. By covering foundational theory as well as state-of-the-art approaches and applications, this course prepares learners to apply and develop generative AI creatively and responsibly for the audio and image modalities. By the end of this course, learners will be able to: Outline core concepts, challenges, and the history of AI-generated audio. Analyze important foundational audio generation models, such as variational and vector quantized autoencoders (VAE and VQ-VAE) Examine how these models integrate with the latest GenAI technologies to form hybrid, state-of-the-art transformer and diffusion-based audio generation systems, Study the architecture and functionality of Generative Adversarial Networks (GANs), and their variations. Implement and train GAN models for creating and enhancing visual content, Explore cutting-edge techniques such as diffusion models and transformers for image and video creation. Discuss the ethical considerations regarding generative AI for audio and images.

This module introduces the foundations and core concepts of AI-generated audio. Learners explore why audio generation is uniquely challenging, such representation and evaluation challenges. They learn how audio is represented and processed, compare waveform and symbolic formats, and common audio data formats and Python libraries for working with audio. The module also examines methods for evaluating generated audio and provides a framework for categorizing audio generation approaches by their functionality and human–AI collaboration level. It concludes with a historical overview of AI-generated audio, tracing its evolution from early rule-based methods to modern deep generative models.

Das ist alles enthalten

21 Videos3 Lektüren4 Aufgaben2 Diskussionsthemen

21 VideosInsgesamt 135 Minuten

Course Introduction5 Minuten
Meet your instructor: Anahita Doosti1 Minute
Meet your instructor: Nasimeh Asgarian1 Minute
Overview of AI for Audio and Music Generation7 Minuten
Why Is Audio Generation Difficult?9 Minuten
Data representation: Waveform vs Symbolic7 Minuten
Data Formats7 Minuten
Evaluation (part 1)4 Minuten
Evaluation (part 2)9 Minuten
Categorizing Audio Generation Approaches5 Minuten
The Many Forms of Audio Generation6 Minuten
Audio Functionality8 Minuten
Human-AI Collaboration 6 Minuten
Putting It into Practice3 Minuten
An Overview of the Progress Throughout the Years6 Minuten
Pre-ML Approaches: Algorithmic, Rule-Based9 Minuten
Early ML Approaches: HMMs, FF Neural Networks6 Minuten
Modern Approaches 1: RNNs and CNNs9 Minuten
Modern Approaches 2: Autoencoders/VAEs and GANs6 Minuten
Modern Approaches 3: Transformers and Diffusion9 Minuten
Module 1 Recap2 Minuten

3 LektürenInsgesamt 140 Minuten

Terminology10 Minuten
Python Libraries for Audio Data10 Minuten
WaveNet Implementation (Hands-on Lab)120 Minuten

4 AufgabenInsgesamt 145 Minuten

Module 1 Quiz80 Minuten
Practice Quiz 130 Minuten
Practice Quiz 220 Minuten
Practice Quiz 315 Minuten

2 DiskussionsthemenInsgesamt 20 Minuten

Learning Goal10 Minuten
Is AI even capable of achieving true creativity?10 Minuten

Building on the fundamentals, this module dives into advanced models for audio generation. Learners study Variational Autoencoders (VAEs) and their variants, and how they apply to melody generation and speech synthesis. The module also explores transformer-based models, such as Music Transformer, AudioLM, and FastSpeech, as well as diffusion-based models like DiffWave and Stable Audio. Through these lessons, learners gain a comprehensive understanding of how modern generative architectures produce realistic, high-quality audio and music.

Das ist alles enthalten

31 Videos2 Lektüren4 Aufgaben

31 VideosInsgesamt 201 Minuten

Introduction to Variational Autoencoders4 Minuten
Autoencoders4 Minuten
Latent Space7 Minuten
Inside the Encoder-Decoder Blocks7 Minuten
Training VAEs (Part 1)4 Minuten
Training VAEs (Part 2)7 Minuten
Vector Quantized Variational Autoencoders (Part 1)6 Minuten
Vector Quantized Variational Autoencoders (Part 2)5 Minuten
Using VAE to Generate Melodies7 Minuten
How to Condition VAEs with Additional Musical Information Such as Chord, Scale?7 Minuten
Example: MusicVAE8 Minuten
Attribute Vector Arithmetic for Melodies 7 Minuten
Example: Jukebox6 Minuten
Example: Speech Synthesis8 Minuten
Strengths and limitations of VAE-based approaches4 Minuten
Transformer Primer5 Minuten
Transformers for Audio Generation5 Minuten
Example: Music Transformer12 Minuten
Revisiting JukeBox: How Transformers Can Generate Waveform Audio! (Part 1)8 Minuten
Revisiting JukeBox: How Transformers Can Generate Waveform Audio! (Part 2)3 Minuten
A New Paradigm: Audio Codec + Language Model (Part 1)5 Minuten
A New Paradigm: Audio Codec + Language Model (Part 2)8 Minuten
Example: FastSpeech7 Minuten
Strengths and Limitations of Transformer-Based Approaches5 Minuten
What Are Diffusion Models, and How Can They Generate Audio?5 Minuten
Example: Stable Audio6 Minuten
Example: DiffWave4 Minuten
Strengths and Limitations of Diffusion-Based Approaches5 Minuten
How Do the Recent Models Compare to Each Other?9 Minuten
What Is on the Horizon? Where Are We Headed?7 Minuten
Module 2 Recap2 Minuten

2 LektürenInsgesamt 130 Minuten

Resource Guide10 Minuten
Audio Generation Models Inference and Comparison (Hands-on Lab)120 Minuten

4 AufgabenInsgesamt 125 Minuten

Module 2 Quiz80 Minuten
Practice Quiz15 Minuten
Practice Quiz15 Minuten
Practice Quiz15 Minuten

This module transitions from audio to image generation, introducing the principles and evolution of image and video synthesis. Learners examine key architectures like GANs and VAEs, explore how adversarial training works, and study variations such as Conditional and Progressive GANs, Pix2Pix, and CycleGAN. The module also connects theory to practice by showcasing creative and commercial applications—from art and design to data augmentation—demonstrating how generative models enhance realism and variety in visual outputs.

Das ist alles enthalten

22 Videos3 Lektüren5 Aufgaben

22 VideosInsgesamt 156 Minuten

Overview of AI for Image and Video Generation7 Minuten
Applications of Image and Video Generation7 Minuten
DALL-E and MidJourney Examples7 Minuten
Sora Examples4 Minuten
A Short History of Image Generation7 Minuten
Revisit VAE5 Minuten
Introducing GAN7 Minuten
Discriminator6 Minuten
Generator8 Minuten
GAN Training6 Minuten
Challenges and Best Practices for GAN Training5 Minuten
Progressive GAN7 Minuten
Conditional GANs7 Minuten
Applications, Advantages and Limitations of cGANs6 Minuten
Image-to-Image Translation7 Minuten
Challenges and Applications of Image-to-Image Translation5 Minuten
Text to Image GAN8 Minuten
Other GAN Variations: Cycle GAN, DCGAN, StyleGAN9 Minuten
Creative design9 Minuten
Commercial Use Cases7 Minuten
Data Augmentation7 Minuten
Module 3 Recap2 Minuten

3 LektürenInsgesamt 140 Minuten

Style GAN10 Minuten
Data synthesis10 Minuten
DCGAN from Scratch (Hands-on Lab)120 Minuten

5 AufgabenInsgesamt 140 Minuten

Module 3 Quiz80 Minuten
Practice Quiz 115 Minuten
Practice Quiz 215 Minuten
Practice Quiz 315 Minuten
Practice Quiz 415 Minuten

In this module,we explore the final stages of what large language models (LLMs) can offer. You’ll learn how and when to use fine-tuning, along with the pros and cons of different approaches. Throughout the course, you will receive relevant assignments that prepare you for the capstone project: building a fully functional chatbot

Das ist alles enthalten

21 Videos1 Lektüre4 Aufgaben

21 VideosInsgesamt 146 Minuten

Overview on Key Models and Architectures7 Minuten
High-Level Overview of Vision Transformer7 Minuten
Encoder-Decoder Design Pattern8 Minuten
Convolutional Encoders9 Minuten
Self Attention9 Minuten
Spatial vs. Channel vs. Temporal Attention8 Minuten
Diffusion Model Architecture High-Level Overview7 Minuten
Forward / Diffusion Process6 Minuten
Reverse Process7 Minuten
Diffusion Model Training4 Minuten
Examples of Diffusion Model5 Minuten
Bias in Training Data8 Minuten
Transparency9 Minuten
Intellectual Property8 Minuten
Data Privacy7 Minuten
Deepfake Intro9 Minuten
Deep Fake - Face Swap5 Minuten
Voice Cloning4 Minuten
Video Deep Fake6 Minuten
Module 4 Recap2 Minuten
Course Wrap Up3 Minuten

1 LektüreInsgesamt 120 Minuten

ViT vs. Diffusion (Hands-on Lab)120 Minuten

4 AufgabenInsgesamt 158 Minuten

Module 4 Quiz80 Minuten
Practice Quiz 130 Minuten
Practice Quiz 230 Minuten
Practice Quiz 318 Minuten

Erwerben Sie ein Karrierezertifikat.

Fügen Sie dieses Zeugnis Ihrem LinkedIn-Profil, Lebenslauf oder CV hinzu. Teilen Sie sie in Social Media und in Ihrer Leistungsbeurteilung.

Dozent

Anahita Doosti

Alberta Machine Intelligence Institute

2 Kurse197 Lernende

von

Alberta Machine Intelligence Institute

Mehr von Algorithms entdecken

Status: Kostenloser Testzeitraum
Alberta Machine Intelligence Institute
Introduction to Generative AI: Concepts and Techniques
Kurs
Status: Kostenloser Testzeitraum
Coursera
Interactive and Immersive Experiences with Generative AI
Kurs
Status: Kostenloser Testzeitraum
Microsoft
Core generative models and techniques
Kurs
Status: Vorschau
AI CERTs
AI for Audio and Music
Kurs

Warum entscheiden sich Menschen für Coursera für ihre Karriere?

Felipe M.

Lernender seit 2018

„Es ist eine großartige Erfahrung, in meinem eigenen Tempo zu lernen. Ich kann lernen, wenn ich Zeit und Nerven dazu habe.“

Jennifer J.

Lernender seit 2020

„Bei einem spannenden neuen Projekt konnte ich die neuen Kenntnisse und Kompetenzen aus den Kursen direkt bei der Arbeit anwenden.“

Larry W.

Lernender seit 2021

„Wenn mir Kurse zu Themen fehlen, die meine Universität nicht anbietet, ist Coursera mit die beste Alternative.“

Chaitanya A.

„Man lernt nicht nur, um bei der Arbeit besser zu werden. Es geht noch um viel mehr. Bei Coursera kann ich ohne Grenzen lernen.“

Häufig gestellte Fragen

To access the course materials, assignments and to earn a Certificate, you will need to purchase the Certificate experience when you enroll in a course. You can try a Free Trial instead, or apply for Financial Aid. The course may offer 'Full Course, No Certificate' instead. This option lets you see all course materials, submit required assessments, and get a final grade. This also means that you will not be able to purchase a Certificate experience.

When you enroll in the course, you get access to all of the courses in the Specialization, and you earn a certificate when you complete the work. Your electronic Certificate will be added to your Accomplishments page - from there, you can print your Certificate or add it to your LinkedIn profile.

Yes. In select learning programs, you can apply for financial aid or a scholarship if you can’t afford the enrollment fee. If fin aid or scholarship is available for your learning program selection, you’ll find a link to apply on the description page.