University of Colorado Boulder

Reward Programming: Optimizing RL Efficiency and Safety

Sichern Sie sich eines unserer besten Angebote mit Coursera Plus für 199 $ (normalerweise 399 $). Jetzt sparen.

kurs ist nicht verfügbar in Deutsch (Deutschland)

Wir übersetzen es in weitere Sprachen.
University of Colorado Boulder

Reward Programming: Optimizing RL Efficiency and Safety

Bei Coursera Plus enthalten

Fragen Sie Coursera

Verschaffen Sie sich einen Einblick in ein Thema und lernen Sie die Grundlagen.
Stufe Mittel

Empfohlene Erfahrung

1 Woche zu vervollständigen
unter 10 Stunden pro Woche
Flexibler Zeitplan
In Ihrem eigenen Lerntempo lernen
Verschaffen Sie sich einen Einblick in ein Thema und lernen Sie die Grundlagen.
Stufe Mittel

Empfohlene Erfahrung

1 Woche zu vervollständigen
unter 10 Stunden pro Woche
Flexibler Zeitplan
In Ihrem eigenen Lerntempo lernen

Was Sie lernen werden

  • Identify limitations of standard scalar reward formulations, including reward hacking, specification gaming, and brittle proxies.

  • Express structured learning objectives using formal tools such as temporal logic, automata, and reward machines.

  • Construct and analyze reward mechanisms based on temporal logic, automata, product MDPs, reward machines, and reward shaping.

  • Model reward-programming problems under hidden state, memory, hierarchy, multiagent interaction, and continuous-time dynamics

Kompetenzen, die Sie erwerben

  • Kategorie: Agentic systems
  • Kategorie: Machine Learning Methods
  • Kategorie: Safety and Security
  • Kategorie: Machine Learning
  • Kategorie: Model Evaluation
  • Kategorie: Theoretical Computer Science
  • Kategorie: Functional Specification
  • Kategorie: Verification And Validation
  • Kategorie: Computational Logic
  • Kategorie: Reinforcement Learning
  • Kategorie: Continuous Monitoring
  • Kategorie: Model Optimization
  • Kategorie: Markov Model
  • Kategorie: Responsible AI

Werkzeuge, die Sie lernen werden

  • Kategorie: AI Workflows

Wichtige Details

Zertifikat zur Vorlage

Zu Ihrem LinkedIn-Profil hinzufügen

Kürzlich aktualisiert!

Juli 2026

Bewertungen

6 Aufgaben

Unterrichtet in Englisch

Erfahren Sie, wie Mitarbeiter führender Unternehmen gefragte Kompetenzen erwerben.

 Logos von Petrobras, TATA, Danone, Capgemini, P&G und L'Oreal

Erweitern Sie Ihre Fachkenntnisse

Dieser Kurs ist Teil der Spezialisierung Spezialisierung „Foundations of Reinforcement Learning“
Wenn Sie sich für diesen Kurs anmelden, werden Sie auch für diese Spezialisierung angemeldet.
  • Lernen Sie neue Konzepte von Branchenexperten
  • Gewinnen Sie ein Grundverständnis bestimmter Themen oder Tools
  • Erwerben Sie berufsrelevante Kompetenzen durch praktische Projekte
  • Erwerben Sie ein Berufszertifikat zur Vorlage

In diesem Kurs gibt es 5 Module

This module introduces reward engineering as the problem of translating designer intent into an objective signal that a reinforcement learning agent can optimize. Classical reinforcement learning often assumes that the reward function is already given. In practice, however, the reward must be designed, specified, shaped, inferred, or audited. We begin with the idea of programming by rewards: instead of programming each action directly, the designer specifies an objective signal and the agent learns behavior by optimizing that signal. This is powerful, but fragile. We then study reward hacking and specification gaming, where high reward does not imply intended behavior. The module then explains why many objectives cannot be faithfully expressed as simple one-step scalar rewards. This motivates Markovian and non-Markovian rewards, and finally a formal-methods perspective based on specifications, monitors, and product MDPs. The goal is to understand why reward design is not merely a numerical tuning problem. It is a specification problem.

Das ist alles enthalten

7 Videos9 Lektüren2 Aufgaben

This module introduces temporal logic as a specification language for reinforcement learning. In earlier modules, rewards were treated as numerical signals used to guide behavior. But many intended objectives are not simply properties of one transition. They are properties of whole trajectories: whether something eventually happens, whether something bad is always avoided, whether every request is eventually served, or whether some desirable condition recurs over time. We begin by translating reinforcement-learning trajectories into labeled traces using atomic propositions. We then introduce the syntax and semantics of linear temporal logic. After that, we study common temporal specification patterns, including reachability, safety, response, recurrence, and persistence. The module then explains ω-regular objectives, Büchi automata, product MDPs, and why the choice of automaton matters for reinforcement learning. The goal is not to become logicians. The goal is to gain a precise language for describing trajectory-level behavior in reinforcement learning.

Das ist alles enthalten

7 Videos8 Lektüren1 Aufgabe

This module develops reward machines as finite-state reward specifications for reinforcement learning. Module 2 introduced temporal logic, ω-regular objectives, Büchi automata, product MDPs, and the limit-reachability bridge from temporal specifications to model-free RL. We now focus on a closely related, but more reward-centered, object: the reward machine. A reward machine is an automaton-like structure that stores task progress and emits rewards. It is useful when the reward depends not only on the current state and action, but also on what has happened before. For example, whether “deliver” is rewarding may depend on whether the package has already been picked up. Whether entering a room is good may depend on whether the key has already been collected. Whether a request has been handled may depend on which request is currently pending. The key idea is simple: environment state + reward-machine state = state with task progress. The environment state records where the agent is. The reward-machine state records what the reward specification remembers. This module treats reward machines as structured reward programs. We define reward machines, explain how they represent non-Markovian rewards, construct product MDPs with reward-machine state, and study examples for common task patterns. We then explain how reward-machine structure supports decomposition, counterfactual experience, interpretability, and auditing. The final lesson surveys richer reward-machine models, including counting, timed, continuous-time, and physics-informed reward machines. The goal is not to repeat the automata theory from Module 2. The goal is to understand reward machines as a practical specification device: a finite-state way to write rewards that depend on task progress.

Das ist alles enthalten

6 Videos7 Lektüren1 Aufgabe

This module studies how rewards can be made more informative, inferred from evidence, and audited for alignment with intended behavior. So far, we have studied how to specify structured objectives using temporal logic, automata, product MDPs, and reward machines. We now ask a complementary question: how can rewards be shaped, inferred, or learned? A reward can be correct but still difficult to learn from. For example, a sparse reward that gives +1 only when a task is completed may faithfully describe success, but it may provide almost no guidance during early learning. Reward shaping adds additional feedback to make learning easier. However, shaping must be designed carefully: a poorly chosen shaping reward can change the incentives and create new specification-gaming opportunities. The module then turns to methods for learning rewards from evidence. Inverse reinforcement learning infers rewards from demonstrations. Preference-based reinforcement learning infers rewards from comparisons or human feedback. Automata-learning methods can infer structured reward machines from interaction. The final lesson studies how learned rewards should be audited for faithfulness, robustness, and reward hacking. The goal is to understand reward programming as both a design problem and an inference problem

Das ist alles enthalten

7 Videos7 Lektüren1 Aufgabe

This module studies reward programming when the standard fully observable MDP abstraction is not rich enough. In a standard MDP, the agent observes the state, chooses an action, receives reward, and transitions to a new state. This abstraction is powerful, but many reward-programming problems require more structure. Sometimes reward design is hard because the environment model is missing the right structure. The agent may not observe the full state. The task may require memory, hierarchy, or recursion. Other agents may affect the outcome. Or the system may evolve in continuous time according to physical dynamics. In such cases, a reward that looks complicated may be a symptom of an inadequate model. This module studies richer abstractions for reward programming: partially observable MDPs, beliefs and memory, hierarchical and recursive task structure, multi-agent interaction, and continuous-time dynamical systems. The goal is not to always use the most expressive model. The goal is to choose the simplest model that faithfully captures the task, supports learning, and remains transparent enough to audit.

Das ist alles enthalten

7 Videos7 Lektüren1 Aufgabe

Erwerben Sie ein Karrierezertifikat.

Fügen Sie dieses Zeugnis Ihrem LinkedIn-Profil, Lebenslauf oder CV hinzu. Teilen Sie sie in Social Media und in Ihrer Leistungsbeurteilung.

Dozent

Ashutosh Trivedi
University of Colorado Boulder
2 Kurse47 Lernende

Mehr von Algorithms entdecken

Warum entscheiden sich Menschen für Coursera für ihre Karriere?

Felipe M.

Lernender seit 2018
„Es ist eine großartige Erfahrung, in meinem eigenen Tempo zu lernen. Ich kann lernen, wenn ich Zeit und Nerven dazu habe.“

Jennifer J.

Lernender seit 2020
„Bei einem spannenden neuen Projekt konnte ich die neuen Kenntnisse und Kompetenzen aus den Kursen direkt bei der Arbeit anwenden.“

Larry W.

Lernender seit 2021
„Wenn mir Kurse zu Themen fehlen, die meine Universität nicht anbietet, ist Coursera mit die beste Alternative.“

Chaitanya A.

„Man lernt nicht nur, um bei der Arbeit besser zu werden. Es geht noch um viel mehr. Bei Coursera kann ich ohne Grenzen lernen.“

Häufig gestellte Fragen