Reward Programming: Optimizing RL Efficiency and Safety

Obtenez l'une de nos meilleures offres avec Coursera Plus pour 199 $ (habituellement 399 $). Économisez maintenant.

Ce cours n'est pas disponible en Français (France)

Nous sommes actuellement en train de le traduire dans plus de langues.

Reward Programming: Optimizing RL Efficiency and Safety

Ce cours fait partie de Spécialisation "Foundations of Reinforcement Learning"

Instructeur : Ashutosh Trivedi

Inclus avec

5 modules

Obtenez un aperçu d'un sujet et apprenez les principes fondamentaux.

niveau Intermédiaire

Expérience recommandée

1 semaine à compléter

à 10 heures par semaine

Planning flexible

Apprenez à votre propre rythme

5 modules

Obtenez un aperçu d'un sujet et apprenez les principes fondamentaux.

niveau Intermédiaire

Expérience recommandée

1 semaine à compléter

à 10 heures par semaine

Planning flexible

Apprenez à votre propre rythme

Ce que vous apprendrez

Identify limitations of standard scalar reward formulations, including reward hacking, specification gaming, and brittle proxies.
Express structured learning objectives using formal tools such as temporal logic, automata, and reward machines.
Construct and analyze reward mechanisms based on temporal logic, automata, product MDPs, reward machines, and reward shaping.
Model reward-programming problems under hidden state, memory, hierarchy, multiagent interaction, and continuous-time dynamics

Compétences que vous acquerrez

Catégorie : Reinforcement Learning
Catégorie : Safety and Security
Catégorie : Machine Learning
Catégorie : Theoretical Computer Science
Catégorie : Model Evaluation
Catégorie : Machine Learning Methods
Catégorie : Agentic systems
Catégorie : Computational Logic
Catégorie : Functional Specification
Catégorie : Verification And Validation
Catégorie : Continuous Monitoring
Catégorie : Responsible AI
Catégorie : Markov Model
Catégorie : Model Optimization

Outils que vous découvrirez

Catégorie : AI Workflows

Détails à connaître

Certificat partageable

Ajouter à votre profil LinkedIn

Récemment mis à jour !

juillet 2026

Évaluations

6 devoirs

Enseigné en Anglais

Découvrez comment les employés des entreprises prestigieuses maîtrisent des compétences recherchées

En savoir plus sur Coursera pour les affaires

logos de Petrobras, TATA, Danone, Capgemini, P&G et L'Oreal

Élaborez votre expertise du sujet

Ce cours fait partie de la Spécialisation "Foundations of Reinforcement Learning"

Lorsque vous vous inscrivez à ce cours, vous êtes également inscrit(e) à cette Spécialisation.

Apprenez de nouveaux concepts auprès d'experts du secteur
Acquérez une compréhension de base d'un sujet ou d'un outil
Développez des compétences professionnelles avec des projets pratiques
Obtenez un certificat professionnel partageable

Il y a 5 modules dans ce cours

How do we design rewards that guide reinforcement-learning agents toward the behavior we actually intend? This course examines reward design as a programming and specification problem in reinforcement learning.

Classical reinforcement learning usually assumes that the objective is given as a scalar reward function. In practice, however, many real tasks involve goals, constraints, temporal order, safety requirements, recurrence, partial observability, hierarchy, other agents, and long-run behavioral expectations that are difficult to express through one-step rewards alone. Poorly designed rewards can lead to reward hacking, specification gaming, and policies that optimize the written objective while missing the designer’s intent. The course introduces reward programming as a structured approach to specifying, shaping, inferring, monitoring, and auditing what an agent should learn. Learners study temporal logic, automata, product MDPs, and reward machines as tools for representing objectives that depend on history, progress, safety, and long-run behavior. They also study reward shaping, inverse reinforcement learning, preference-based feedback, and automata-learning approaches for inferring or improving reward mechanisms. The course then examines richer modeling abstractions for reward programming, including partially observable Markov decision processes, memory and beliefs, hierarchical and recursive tasks, multi- agent settings, and continuous-time systems. The final module studies safety, shielding, constrained RL, auditing, stress testing, and a reward-engineering workflow for connecting designer intent to specifications, reward mechanisms, learning, safety layers, and revision. By the end of the course, learners will be able to design and infer structured reward mechanisms, evaluate whether they align with intended behavior, and reason about their implications for safety, transparency, and reliability. This course can be taken for academic credit as part of CU Boulder’s Masters of Science in Computer Science (MS-CS) and Master of Science in Artificial Intelligence (MS-AI) degrees offered on the Coursera platform. These fully accredited graduate degrees offer targeted courses, short 8-week sessions, and pay-as-you-go tuition. Admission is based on performance in three preliminary courses, not academic history. CU degrees on Coursera are ideal for recent graduates or working professionals. Learn more: MS in Artificial Intelligence: https://www.coursera.org/degrees/ms-artificial-intelligence-boulder MS in Computer Science: https://coursera.org/degrees/ms-computer-science-boulder

This module introduces reward engineering as the problem of translating designer intent into an objective signal that a reinforcement learning agent can optimize. Classical reinforcement learning often assumes that the reward function is already given. In practice, however, the reward must be designed, specified, shaped, inferred, or audited. We begin with the idea of programming by rewards: instead of programming each action directly, the designer specifies an objective signal and the agent learns behavior by optimizing that signal. This is powerful, but fragile. We then study reward hacking and specification gaming, where high reward does not imply intended behavior. The module then explains why many objectives cannot be faithfully expressed as simple one-step scalar rewards. This motivates Markovian and non-Markovian rewards, and finally a formal-methods perspective based on specifications, monitors, and product MDPs. The goal is to understand why reward design is not merely a numerical tuning problem. It is a specification problem.

Inclus

7 vidéos9 lectures2 devoirs

7 vidéosTotal 71 minutes

Course Introduction12 minutes
Module Introduction3 minutes
Reward Engineering as Specification Design9 minutes
When High Reward is not the Intended Behavior12 minutes
Temporal Structure in Learning Objectives11 minutes
When Rewards Need Memory10 minutes
From Intent to Specifications and Montiors14 minutes

9 lecturesTotal 75 minutes

Earn Academic Credit for your Work!10 minutes
Course Support10 minutes
Assessment Expectations5 minutes
Reward Engineernig15 minutes
Reward Hacking and Specification Gaming10 minutes
Beyond One-Step Rewards8 minutes
Markovian and Non-Markovian Rewards5 minutes
Formal-Methods10 minutes
Summary2 minutes

2 devoirsTotal 50 minutes

AI Policy Quiz5 minutes
Why Reward Engineering Is Hard45 minutes

This module introduces temporal logic as a specification language for reinforcement learning. In earlier modules, rewards were treated as numerical signals used to guide behavior. But many intended objectives are not simply properties of one transition. They are properties of whole trajectories: whether something eventually happens, whether something bad is always avoided, whether every request is eventually served, or whether some desirable condition recurs over time. We begin by translating reinforcement-learning trajectories into labeled traces using atomic propositions. We then introduce the syntax and semantics of linear temporal logic. After that, we study common temporal specification patterns, including reachability, safety, response, recurrence, and persistence. The module then explains ω-regular objectives, Büchi automata, product MDPs, and why the choice of automaton matters for reinforcement learning. The goal is not to become logicians. The goal is to gain a precise language for describing trajectory-level behavior in reinforcement learning.

Inclus

7 vidéos8 lectures1 devoir

7 vidéosTotal 65 minutes

Module 2 Introduction3 minutes
Labeling Behavior with Atomic Propositions9 minutes
Writing Temporal Specifications13 minutes
Interpreting Formulas Over Traces11 minutes
Reachability, Safety, Response, Recurrence, and Persitence9 minutes
Long-Run Objectives for Infinite Behavior6 minutes
Product MDPs, Limit Determinism, and Limit Reachability15 minutes

8 lecturesTotal 86 minutes

RL Trajectories and Labeled Traces15 minutes
Linear Temporal Logic Semantics10 minutes
Policies and Satisfaction Probability5 minutes
Common Specification Patterns10 minutes
ω -Regular Objectives8 minutes
Büchi Automata as Monitors8 minutes
From Automata to Model-Free RL25 minutes
Summary5 minutes

1 devoirTotal 45 minutes

Temporal Logic for Reinforcement Learning45 minutes

This module develops reward machines as finite-state reward specifications for reinforcement learning. Module 2 introduced temporal logic, ω-regular objectives, Büchi automata, product MDPs, and the limit-reachability bridge from temporal specifications to model-free RL. We now focus on a closely related, but more reward-centered, object: the reward machine. A reward machine is an automaton-like structure that stores task progress and emits rewards. It is useful when the reward depends not only on the current state and action, but also on what has happened before. For example, whether “deliver” is rewarding may depend on whether the package has already been picked up. Whether entering a room is good may depend on whether the key has already been collected. Whether a request has been handled may depend on which request is currently pending. The key idea is simple: environment state + reward-machine state = state with task progress. The environment state records where the agent is. The reward-machine state records what the reward specification remembers. This module treats reward machines as structured reward programs. We define reward machines, explain how they represent non-Markovian rewards, construct product MDPs with reward-machine state, and study examples for common task patterns. We then explain how reward-machine structure supports decomposition, counterfactual experience, interpretability, and auditing. The final lesson surveys richer reward-machine models, including counting, timed, continuous-time, and physics-informed reward machines. The goal is not to repeat the automata theory from Module 2. The goal is to understand reward machines as a practical specification device: a finite-state way to write rewards that depend on task progress.

Inclus

6 vidéos7 lectures1 devoir

6 vidéosTotal 38 minutes

Module Introduction4 minutes
When Rewards Need Memory7 minutes
States, Transitions, and Rewards8 minutes
Making Non-Markovian Rewards Markovian6 minutes
Sequencing, Safety, Response, and Recurrence7 minutes
Decomposition, Interpretability, and Counterfactual Experience6 minutes

7 lecturesTotal 45 minutes

Why Rewards May Need Memory10 minutes
Reward Machines10 minutes
Product MDPs with Reward-Machine State8 minutes
Reward Machines for Common Task Patterns5 minutes
Using Reward-Machine Structure5 minutes
Extensions and Frontiers5 minutes
Summary2 minutes

1 devoirTotal 45 minutes

Reward Machines as Structured Reward Specifications45 minutes

This module studies how rewards can be made more informative, inferred from evidence, and audited for alignment with intended behavior. So far, we have studied how to specify structured objectives using temporal logic, automata, product MDPs, and reward machines. We now ask a complementary question: how can rewards be shaped, inferred, or learned? A reward can be correct but still difficult to learn from. For example, a sparse reward that gives +1 only when a task is completed may faithfully describe success, but it may provide almost no guidance during early learning. Reward shaping adds additional feedback to make learning easier. However, shaping must be designed carefully: a poorly chosen shaping reward can change the incentives and create new specification-gaming opportunities. The module then turns to methods for learning rewards from evidence. Inverse reinforcement learning infers rewards from demonstrations. Preference-based reinforcement learning infers rewards from comparisons or human feedback. Automata-learning methods can infer structured reward machines from interaction. The final lesson studies how learned rewards should be audited for faithfulness, robustness, and reward hacking. The goal is to understand reward programming as both a design problem and an inference problem

Inclus

7 vidéos7 lectures1 devoir

7 vidéosTotal 38 minutes

Module Introduction3 minutes
Making Learning Signals More Informative5 minutes
Densifying Rewards While Preserving Optimality5 minutes
Inferring Rewards From Expert Behavior3 minutes
Learning Objectives From Comparisons7 minutes
Inferring Structured Reward Logic6 minutes
Faithfulness, Robustness, and Specification Gaming9 minutes

7 lecturesTotal 45 minutes

Reward Shaping10 minutes
Potential-Based Shaping10 minutes
Inverse Reinforcement Learning8 minutes
Preference-Based Reinforcement Learning5 minutes
Learning Reward Machines5 minutes
Auditing Learned Rewards5 minutes
Summary2 minutes

1 devoirTotal 45 minutes

Reward Shaping and Learning Rewards45 minutes

This module studies reward programming when the standard fully observable MDP abstraction is not rich enough. In a standard MDP, the agent observes the state, chooses an action, receives reward, and transitions to a new state. This abstraction is powerful, but many reward-programming problems require more structure. Sometimes reward design is hard because the environment model is missing the right structure. The agent may not observe the full state. The task may require memory, hierarchy, or recursion. Other agents may affect the outcome. Or the system may evolve in continuous time according to physical dynamics. In such cases, a reward that looks complicated may be a symptom of an inadequate model. This module studies richer abstractions for reward programming: partially observable MDPs, beliefs and memory, hierarchical and recursive task structure, multi-agent interaction, and continuous-time dynamical systems. The goal is not to always use the most expressive model. The goal is to choose the simplest model that faithfully captures the task, supports learning, and remains transparent enough to audit.

Inclus

7 vidéos7 lectures1 devoir

7 vidéosTotal 46 minutes

Module Introduction4 minutes
When the Agent Does Not See the Full State8 minutes
Information States For Reward Programming7 minutes
Reward Structure for Composed Tasks8 minutes
Objectives Under Strategic Interaction7 minutes
Reward Programming Beyond Discrete Time9 minutes
Summary and Further Readings3 minutes

7 lecturesTotal 49 minutes

Partial Observability12 minutes
Histories, Beliefs, and Memory8 minutes
Hierarchy and Recursion12 minutes
Multi-Agent Reward Programming5 minutes
Continuous-Time and Dynamical Systems5 minutes
Choosing the Right Abstraction5 minutes
Summary2 minutes

1 devoirTotal 45 minutes

Partial Observability45 minutes

Obtenez un certificat professionnel

Ajoutez ce titre à votre profil LinkedIn, à votre curriculum vitae ou à votre CV. Partagez-le sur les médias sociaux et dans votre évaluation des performances.

Instructeur

Ashutosh Trivedi

University of Colorado Boulder

3 Cours60 apprenants

Offert par

University of Colorado Boulder

En savoir plus sur Algorithms

University of Colorado Boulder
Deep Reinforcement Learning: From Theory to Practice
Cours
University of Colorado Boulder
Mastering Classic Reinforcement Learning Algorithms
Cours

Pour quelles raisons les étudiants sur Coursera nous choisissent-ils pour leur carrière ?

Felipe M.

Étudiant(e) depuis 2018

’Pouvoir suivre des cours à mon rythme à été une expérience extraordinaire. Je peux apprendre chaque fois que mon emploi du temps me le permet et en fonction de mon humeur.’

Jennifer J.

Étudiant(e) depuis 2020

’J'ai directement appliqué les concepts et les compétences que j'ai appris de mes cours à un nouveau projet passionnant au travail.’

Larry W.

Étudiant(e) depuis 2021

’Lorsque j'ai besoin de cours sur des sujets que mon université ne propose pas, Coursera est l'un des meilleurs endroits où se rendre.’

Chaitanya A.

’Apprendre, ce n'est pas seulement s'améliorer dans son travail : c'est bien plus que cela. Coursera me permet d'apprendre sans limites.’

Foire Aux Questions

To access the course materials, assignments and to earn a Certificate, you will need to purchase the Certificate experience when you enroll in a course. You can try a Free Trial instead, or apply for Financial Aid. The course may offer 'Full Course, No Certificate' instead. This option lets you see all course materials, submit required assessments, and get a final grade. This also means that you will not be able to purchase a Certificate experience.

When you enroll in the course, you get access to all of the courses in the Specialization, and you earn a certificate when you complete the work. Your electronic Certificate will be added to your Accomplishments page - from there, you can print your Certificate or add it to your LinkedIn profile.

Yes. In select learning programs, you can apply for financial aid or a scholarship if you can’t afford the enrollment fee. If fin aid or scholarship is available for your learning program selection, you’ll find a link to apply on the description page.