When we learned to walk, swim, or play tennis, we solved sequential decision-making problems and, in all these cases, direct experience played a fundamental role. Actually, we face several sequential decision-making problems every day. An average adult is estimated to make more than 35,000 decisions a day. Let’s consider a child who learns to walk. Most of us learned to walk when we were about one year old. At that age, no one has explicit knowledge of physics, and the ability to understand parental suggestions is limited. We learn to walk primarily through a trial-and-error process. Moved by the desire to reach some objects, we try to get up on two feet and take the first steps. In the beginning, this leads to falls that teach us which movements are wrong. By dint of trying again, our brain learns what are the commands that it needs to send to our muscles to reach our goal in the most effective way. This is the same principle behind Reinforcement Learning methods. Reinforcement learning is the part of machine learning that aims at solving sequential decision-making problems. Unlike traditional methods, reinforcement learning techniques do not require prior knowledge of the dynamics of the problem, as they learn to solve the decision-making problem through a direct interaction between the agent and the environment in which it operates. The reinforcement learning process can be modeled as an iterative loop that works as follows: First of all, the agent observes the environment to understand what state it is in; Then, it decides what to do and executes an action; As a result of this action, the environment changes and the agent receives a reward signal. The reward is a value that specifies the immediate utility of the executed action in that specific state; This completes one iteration, and the next one will start with the observations gathered from the new state of the environment. The agent’s goal is not to select the action that attains the maximum immediate reward, but to collect as much reward as possible over a given time horizon. We often need to sacrifice immediate rewards to achieve a long-term goal. When rewards are temporally delayed, we have a credit assignment problem which slows down the learning process. In fact, we need to estimate which actions among the ones executed by the agent allowed it to get a certain reward. Recently, thanks also to the combination with deep learning techniques, reinforcement learning algorithms have achieved some of the most important milestones in the history of Artificial Intelligence. In 2016, the AlphaGo algorithm, developed by Google Deepmind, managed to defeat the 18-time world champion Lee Sedol in the board game Go with a score of 4 to 1. This was a very important and unexpected result, as most AI experts forecasted such a result not earlier than 2050, given the complexity of this game which has more than 10 to the power of 170 board configurations. AlphaGo achieved this exceptional result by combining deep reinforcement learning and planning techniques. Since 2016, researchers have been developing new algorithms, like AlphaZero or MuZero, able to defeat AlphaGo with much lower computational costs. More recently, other successes have been achieved through reinforcement learning. In 2019, OpenAI Five, developed by OpenAI, was able to defeat a team of world champions in the videogame Dota 2. In the same year AlphaStar, developed by Google Deepmind, reached the grandmaster level in the videogame StarCraft II, ranking among the top 0.2 percent of human players. These are two very complex strategy games that present several challenges for Artificial Intelligence algorithms, such as real-time decisions, partial observability, and high-dimensional action spaces. These are characteristics present in many real-world decision-making problems and these recent successes show that reinforcement learning algorithms could help us make complex decisions in the near future. Nonetheless, it is worth noting that these results were obtained at the cost of very expensive simulations that generated a much larger amount of experience than required by humans. For instance, AlphaStar was trained for more than six weeks, generating the equivalent of over 100,000 years of StarCraft playtime. Such an amount of experience is only possible in simulated problems and with high-performing computing infrastructures. Now, the challenge is to develop reinforcement learning algorithms that can learn with the amount of experience typically used by humans. To achieve this goal, the inductive approach, which characterizes machine learning approaches, needs to be combined with deductive Artificial Intelligence approaches based on the knowledge of the model. In order to determine which reinforcement learning techniques must be considered, it is important to correctly classify the decision-making problem and formalize it. To classify this problem, we need to know whether the observations and the possible actions are finite or continuous, whether the agent is able to observe the state of the problem or we have only partial observations, whether the state transitions are deterministic or stochastic, whether the problem is stationary or nonstationary, and whether the environment is with a single agent or it presents other learning agents. For instance, the Rubik’s cube game is a sequential decision-making problem with finite actions and finite states, which are fully observable. Moreover, the effect of the actions on the state transition is deterministic, the game is stationary since it doesn't change over time, and in the game there is a single agent. If we consider a poker game instead, we have finite actions and states, which are partially observable as we are not able to observe the private cards of our opponents. The state transitions are stochastic as the cards are drawn randomly from the deck, the problem is stationary, and it presents multiple agents that are concurrently learning. In the next lesson, we will analyze the Markov decision processes and we will discuss how they can be used to formalize sequential decision-making problems and how it is possible to solve them.