Generative AI for Audio and Images: Models and Applications

Generative AI for Audio and Images: Models and Applications

Instructor: Anahita Doosti

Access provided by Duke University

4 modules

Gain insight into a topic and learn the fundamentals.

3 weeks to complete

at 10 hours a week

Flexible schedule

Learn at your own pace

4 modules

Gain insight into a topic and learn the fundamentals.

3 weeks to complete

at 10 hours a week

Flexible schedule

Learn at your own pace

Skills you'll gain

Details to know

Shareable certificate

Add to your LinkedIn profile

Assessments

17 assignments

Taught in English

See how employees at top companies are mastering in-demand skills

Learn more about Coursera for Business

logos of Petrobras, TATA, Danone, Capgemini, P&G and L'Oreal

There are 4 modules in this course

Generative AI for Audio and Images: Models and Applications offers an in-depth exploration of how modern generative models such as Variational Autoencoders (VAEs), Generative Adversarial Networks (GANs), Transformers, and Diffusion models are used to create, manipulate, and enhance audio, image, and video content.

Learners examine the architectures, training processes, and use cases of these models across different modalities, gaining both conceptual understanding and practical insights through hands-on activities. The course also highlights the ethical and societal implications of generative AI, including bias, transparency, intellectual property, and the challenges of deepfake technologies. By covering foundational theory as well as state-of-the-art approaches and applications, this course prepares learners to apply and develop generative AI creatively and responsibly for the audio and image modalities. By the end of this course, learners will be able to: Outline core concepts, challenges, and the history of AI-generated audio. Analyze important foundational audio generation models, such as variational and vector quantized autoencoders (VAE and VQ-VAE) Examine how these models integrate with the latest GenAI technologies to form hybrid, state-of-the-art transformer and diffusion-based audio generation systems, Study the architecture and functionality of Generative Adversarial Networks (GANs), and their variations. Implement and train GAN models for creating and enhancing visual content, Explore cutting-edge techniques such as diffusion models and transformers for image and video creation. Discuss the ethical considerations regarding generative AI for audio and images.

This module introduces the foundations and core concepts of AI-generated audio. Learners explore why audio generation is uniquely challenging, such representation and evaluation challenges. They learn how audio is represented and processed, compare waveform and symbolic formats, and common audio data formats and Python libraries for working with audio. The module also examines methods for evaluating generated audio and provides a framework for categorizing audio generation approaches by their functionality and human–AI collaboration level. It concludes with a historical overview of AI-generated audio, tracing its evolution from early rule-based methods to modern deep generative models.

What's included

21 videos3 readings4 assignments2 discussion prompts

21 videosTotal 135 minutes

Course Introduction5 minutes
Meet your instructor: Anahita Doosti1 minute
Meet your instructor: Nasimeh Asgarian1 minute
Overview of AI for Audio and Music Generation7 minutes
Why Is Audio Generation Difficult?9 minutes
Data representation: Waveform vs Symbolic7 minutes
Data Formats7 minutes
Evaluation (part 1)4 minutes
Evaluation (part 2)9 minutes
Categorizing Audio Generation Approaches5 minutes
The Many Forms of Audio Generation6 minutes
Audio Functionality8 minutes
Human-AI Collaboration 6 minutes
Putting It into Practice3 minutes
An Overview of the Progress Throughout the Years6 minutes
Pre-ML Approaches: Algorithmic, Rule-Based9 minutes
Early ML Approaches: HMMs, FF Neural Networks6 minutes
Modern Approaches 1: RNNs and CNNs9 minutes
Modern Approaches 2: Autoencoders/VAEs and GANs6 minutes
Modern Approaches 3: Transformers and Diffusion9 minutes
Module 1 Recap2 minutes

3 readingsTotal 140 minutes

Terminology10 minutes
Python Libraries for Audio Data10 minutes
WaveNet Implementation (Hands-on Lab)120 minutes

4 assignmentsTotal 145 minutes

Module 1 Quiz80 minutes
Practice Quiz 130 minutes
Practice Quiz 220 minutes
Practice Quiz 315 minutes

2 discussion promptsTotal 20 minutes

Learning Goal10 minutes
Is AI even capable of achieving true creativity?10 minutes

Building on the fundamentals, this module dives into advanced models for audio generation. Learners study Variational Autoencoders (VAEs) and their variants, and how they apply to melody generation and speech synthesis. The module also explores transformer-based models, such as Music Transformer, AudioLM, and FastSpeech, as well as diffusion-based models like DiffWave and Stable Audio. Through these lessons, learners gain a comprehensive understanding of how modern generative architectures produce realistic, high-quality audio and music.

What's included

31 videos2 readings4 assignments

31 videosTotal 201 minutes

Introduction to Variational Autoencoders4 minutes
Autoencoders4 minutes
Latent Space7 minutes
Inside the Encoder-Decoder Blocks7 minutes
Training VAEs (Part 1)4 minutes
Training VAEs (Part 2)7 minutes
Vector Quantized Variational Autoencoders (Part 1)6 minutes
Vector Quantized Variational Autoencoders (Part 2)5 minutes
Using VAE to Generate Melodies7 minutes
How to Condition VAEs with Additional Musical Information Such as Chord, Scale?7 minutes
Example: MusicVAE8 minutes
Attribute Vector Arithmetic for Melodies 7 minutes
Example: Jukebox6 minutes
Example: Speech Synthesis8 minutes
Strengths and limitations of VAE-based approaches4 minutes
Transformer Primer5 minutes
Transformers for Audio Generation5 minutes
Example: Music Transformer12 minutes
Revisiting JukeBox: How Transformers Can Generate Waveform Audio! (Part 1)8 minutes
Revisiting JukeBox: How Transformers Can Generate Waveform Audio! (Part 2)3 minutes
A New Paradigm: Audio Codec + Language Model (Part 1)5 minutes
A New Paradigm: Audio Codec + Language Model (Part 2)8 minutes
Example: FastSpeech7 minutes
Strengths and Limitations of Transformer-Based Approaches5 minutes
What Are Diffusion Models, and How Can They Generate Audio?5 minutes
Example: Stable Audio6 minutes
Example: DiffWave4 minutes
Strengths and Limitations of Diffusion-Based Approaches5 minutes
How Do the Recent Models Compare to Each Other?9 minutes
What Is on the Horizon? Where Are We Headed?7 minutes
Module 2 Recap2 minutes

2 readingsTotal 130 minutes

Resource Guide10 minutes
Audio Generation Models Inference and Comparison (Hands-on Lab)120 minutes

4 assignmentsTotal 125 minutes

Module 2 Quiz80 minutes
Practice Quiz15 minutes
Practice Quiz15 minutes
Practice Quiz15 minutes

This module transitions from audio to image generation, introducing the principles and evolution of image and video synthesis. Learners examine key architectures like GANs and VAEs, explore how adversarial training works, and study variations such as Conditional and Progressive GANs, Pix2Pix, and CycleGAN. The module also connects theory to practice by showcasing creative and commercial applications—from art and design to data augmentation—demonstrating how generative models enhance realism and variety in visual outputs.

What's included

22 videos3 readings5 assignments

22 videosTotal 156 minutes

Overview of AI for Image and Video Generation7 minutes
Applications of Image and Video Generation7 minutes
DALL-E and MidJourney Examples7 minutes
Sora Examples4 minutes
A Short History of Image Generation7 minutes
Revisit VAE5 minutes
Introducing GAN7 minutes
Discriminator6 minutes
Generator8 minutes
GAN Training6 minutes
Challenges and Best Practices for GAN Training5 minutes
Progressive GAN7 minutes
Conditional GANs7 minutes
Applications, Advantages and Limitations of cGANs6 minutes
Image-to-Image Translation7 minutes
Challenges and Applications of Image-to-Image Translation5 minutes
Text to Image GAN8 minutes
Other GAN Variations: Cycle GAN, DCGAN, StyleGAN9 minutes
Creative design9 minutes
Commercial Use Cases7 minutes
Data Augmentation7 minutes
Module 3 Recap2 minutes

3 readingsTotal 140 minutes

Style GAN10 minutes
Data synthesis10 minutes
DCGAN from Scratch (Hands-on Lab)120 minutes

5 assignmentsTotal 140 minutes

Module 3 Quiz80 minutes
Practice Quiz 115 minutes
Practice Quiz 215 minutes
Practice Quiz 315 minutes
Practice Quiz 415 minutes

In this module,we explore the final stages of what large language models (LLMs) can offer. You’ll learn how and when to use fine-tuning, along with the pros and cons of different approaches. Throughout the course, you will receive relevant assignments that prepare you for the capstone project: building a fully functional chatbot