We've been talking about algorithms and how they are increasingly complex, and now we've arrived at the moment where we can program the program and talk about the type of algorithms that are the most complex and used very often in machine learning today. Our goals for this lesson are to first define predictive modeling and how that relates to the algorithms we've been talking about, and then identify the goals in model accuracy. How do we define these models, and then build what is actually accurate to reflect the data that we're putting into it? So let's get started by talking about algorithmic complexity. The basic learning that we've explored in the previous lessons, talked about how to give a set series of mechanisms in the case of sorting, which we wanted to find which one works best for the given input. The explicit instructions mean that if something goes wrong with the algorithm, we can go in and see exactly what happened. It looks like it picked this sorting mechanism which wasn't the best fit, so let's switch to this one. Now, when we get to a more complex algorithm, we start to pull away from our actual understanding of how the model is making decisions. We tell the model, "Hey, invent the best mechanism based on the data and the goal." We can tweak things around, but the end result is that we don't really know how the algorithm starts to decide things. This is how algorithms can start to become so complex that we lose sight of how it's making its decision. In a basic model, we can view the inputs, the outputs, and the algorithm itself. In the previous example, let's take those 10 sorting mechanisms and dive in and maybe remove the two that keep giving the algorithms some problems and then go back and figure out how to make a better algorithm. In a complex model, however, we can view the inputs and the outputs, but we can't actually view the algorithm. It is complex in nature. The end result that we don't exactly know how something's being decided. The 3D representation here, even though it can be a little confusing, makes sense if you think about it. A basic graph of, let's say, 500 points in space is a lot easier to understand in 2D than in 3D. If you try to draw a line through that graph, all of a sudden in 3D, it becomes very hard to understand. That's a good analogy for what researchers experience when they're building algorithms that are this complex. They really don't exactly know what decisions are being made to get that output, but they can have control over the inputs and how the algorithm is made. This idea of a algorithm that derives itself is called modeling. It's a complex learning algorithm and it has a couple different purposes. But at the heart of a model, is an algorithm that's automatically derived from data. We focus on the data that we give this model and then we adjust what the output looks like. This is a great way of thinking about the concept as it's related to the first sets of algorithms we've talked about. With the output itself being an algorithm, that means we have less control over it. We can't explicitly program it. So let's talk about the difference between developing an algorithm and developing a model. Developing an algorithm is, again, writing those explicit instructions, step 1, step 2, step 3, and then repeat step 1 through 3 with this variable changed. Then we evaluate the performance. That algorithm is doing a pretty great job sorting those numbers, and then we adjust the instructions to see how that affects the output. Now, with developing a model, again, we go to that 3D analogy because we write the instructions on how to model the data, we feed the data to this algorithm, and then it does its thing. We evaluate its performance and adjust the modeling and see how that affects the output after the fact. But we don't actually know how it's made that performance. We'll talk later on on how we can figure out what exactly the model is doing and how that's going to relate to ethics. But for now, remember that the model is actually just developing itself. It is a model of self-programming. We tell the computer how to handle the data and then we watch it actually make its own program. So why is this so useful and why do we get these complex models in production? It's really all about making predictions. Let's look at a scenario that you might see in a research lab in a university. The question posed from a real estate friend is, "Where are the next new homes going to be built in our town?" How can we predict that? You can imagine a very useful prediction because of constructing new roads, new sewer systems. You need to predict where the new homes are going to be built in this town. So when it comes to building a model, we're given data that we need to figure out how to model into something that will create a prediction. The first dataset we're given in this example is new home construction by square feet and address for the past 30 years. A pretty big dataset in this town, depending on size. Dataset 2, is the list of potential developed lots by address. So this is the potential areas where homes could be built. So the input for this model is we will eventually want to give it the potential home size in square feet for a home to be developed, and the output would be the prediction of where the model thinks that home is going to be. So this is a pretty complex thing to think through. So let's break it down to how we'd actually build this model. We start in the develop phase. We want to specify the type of modeling we would use to build this. Don't worry about the actual types of modeling for now. Backpropagation and supervised learning are examples, and we'll talk about those and trade-offs of those later on. But for now, just note that the step 1 is essentially us telling the model, "Hey, we're going to hand you this type of data, we'd like you to approach it in this way." This is how as researchers and programmers we have influence over how the model is built. Then, of course, we need to feed it the data. So we need to clean and format that data. You'd be surprised how much time this can actually take to clean and format and scrub the data to make sure it's accurate before we give it to the model, and that can be a big source of inaccuracy. Then, we plug the first 10 years of data into the model and let it build itself. So now that the model is developed, we get to the next phase, which is the training phase. We've developed it, now we train it. So this is that level of abstraction we were talking about earlier. We don't actually know how the program is doing or what it's doing, so instead, we can start to train it just how you train a pet. You would try to repeat behavior that you'd like to see and discourage behavior that you don't want to see. So how do we do this? Step 1 is the model will attempt to predict. So we have a dataset and we take away the actual end result. We feed it some square footage and say, where do you think this home would go? Then the model will create an error function. So the predicted address versus the actual address. This is where we can see, okay, is the model actually getting things right? Is it coming close? Is it two blocks away or 30 blocks away from where the actual home ended up? As we repeat that model, we start to train it. So we'd want to find what's called a better curve. For now, you can just think about that as the model has to figure out a better way to predict things so that the error rate is low as possible. So it's really helpful in the training phase to plug the next 10 years of data into the model to see how it can get that error rate as low as possible. Once a model is sufficiently trained, then we get to the deployment phase. The model returns that error rate that's low enough to predict, okay, these homes are going to be built in this area, and then we can start to actually use it in production. In this case, our university or lab setting, where we plug in potential home sizes and the model returns possible lot locations. So all you really need to take away here is we need to develop, train, and deploy a predictive model with the goal in a lab of making predictions and being a useful algorithm that is as complex as you can be. The final goal of building a predictive model in this lab setting is that we want to make sure that it reflects actual data. We've deployed it and now we will just evaluate its performance. So at this stage, we would take the final 10 years of data, plug it into the model, and compare the predicted locations for the lots with the actual locations where they ended up. As researchers, we'd then go figure out better training methods, better tests to write, and ultimately look at the error rates. The error rate would be pretty much everything here. Is it a five percent error rate down to one percent error rate? That determines our barometer of success. So that's a little bit about predictive models. At this stage, we are going to start to transition into using models outside of the lab setting, but for now, that will do it and we'll see you in the comments section.