Hello, I'm Juan Vergara. Welcome back. In this lesson, we will dive deep into training brains with data. So far, we have seen the different types of simulations available. In a teaching environment, the simulation is the key piece for training brains. In some cases, though, the requirement for optimizing a given control comes prior to the existence of a simulation for that problem. When an environment doesn't have a simulation, the first question to ask ourselves is, what is the time needed to build a trustworthy simulation? If the answer is 1-2 months, then opt for building the simulation. First principle simulations are better for training since they allow for free brain exploration. Unfortunately, sometimes building an accurate simulation would be too difficult or time-consuming. Due to the existence of uncertain disturbances or overly complex dynamics, it might be hard to build such simulation. The question to ask then is, do we have historical data available? Are we recording the operator's control and saving it somewhere that can be accessed? If we have enough data, then we can build a data-driven model. These are useful when the subject matter experts might not have a full sense of the dynamics of the system. To build a first principle model, you certainly need a deep understanding of the environment enough to code the transitions into a new state based on the previous state and action pair. One of the greatest bonsai success stories is the one with PepsiCo when collaborating to control the production of Cheetos. Our goal was to build a brain that would control the machine towards optimizing for Cheetos puffiness, size, and curvature. The effect of temperature and time on the final puffiness, size, and curvature is only understood by subject matter experts. These dynamics are not easily translatable to an equation-based model. For Cheetos' use case, the decision was clear. We needed to build a data-driven model to train the brain around the experts' region of operation. Before starting any new data-driven model, we need to verify that a model can be built. With a couple of quick checks, we can validate upfront without much time investment. First, do we have enough points of historical data at whichever control frequency we expect the brain to take decisions upon? Less than 50,000 rows are considered insufficient data. Do we have recorded all variables that operators rely on to control the environment? Any unrecorded input can make your teaching framework for the brain unusable. Any inputs used by operators must be recorded for brain accountability during training and deployment as well. Let's say the operators rely on their vision to control the environment. Then you most likely will need to embed a computer vision model that extracts whichever metrics the operators rely on. Another question to be asked is if the historical control is a static or if there is control variability across operators. Since the brain relies on the data-driven model, we shouldn't trust any recommendations that are too far from the operating region present in the historical data. Thus, the more variability we have in the data, the more regions we can be certain that data-driven model is accurate in its predictions. Lastly, do we have all scenarios of interest for brain control in our historical dataset? Our data-driven model will not be able to predict the scenarios that are not present during model building. Thus, brains will only be trainable based on existing scenarios. These quick checks will help you perform a preliminary assessment of a project which has data but not a first-principle model. There are further guidelines that we have identified over time to validate data for a project, yet we will not dive any deeper since assessing the feasibility of a project based on the available data is a topic of its own. Once you have vetted that it is worth investing the modeling time based on the data available, you can move ahead and build the model. Starting simple often works best. A decision tree or a forest would be a good starting point. Note that the Bonsai production empowerment team put together an accelerator to ease the initial steps of data-driven modeling. Coding experiences required since it is all Python-based. Feel free to use it for both data validation as well as model building. Once you have built your model, you will want to assess it. There are two assessment metrics that must be taken into consideration. First, iteration predictability, which informs how your model predictions differ from the available historical data. This is a necessary preliminary step to ensure your model predicts the next state accurately based on any given state action pair. Second, you want to assess episode predictability. Once your per-iteration predictability is good, you want to look at sequential predictability. What is the drift your modeling incurs when using the predictions of your model as input for the next iteration? Based on the historical actions for a given set of initial conditions, your model should be able to arrive at the same final state condition after several concatenated sequential predictions. You want your brain to be able to explore around the historical state action pairs, but not be limited to those only. That will help you understand where there is room for improvement in the process. Iteration predictability is important since it is the first state of model understanding. Nonetheless, to be effective at training a brain, you need to check the sequential protection as well. You need to ensure that your model does not drift too far when relying on its own predictions to retrieve the next state. Once your data is modeled and your sequential prediction is good, you're ready to train your brain. You want to stop training if the control drifts too far from the original expert control. This ensures that you do not rely on regions where you are not able to assess your data-driven model predictability regions that are not part of your historical dataset. Alternatively, you can just perform unconstrained brain training and check for the deviation from benchmark control, after training is done. We can further trust the brain control the closer it is to the expert historical control. In the next video, we will dive deep into first-principle simulations. These are always easier to work with since the underlying equations can help you guide machine teaching experimentation. See you there.