In the last lesson, we discussed artificial neurons. We talked about how they're structured and how we can train them using gradient descent to iteratively change the weights until we find the optimal weights that minimize the cost. Artificial neurons are powerful, but their power is limited because they can only handle problems with linear decision boundaries. Researchers knew at the time in the '50s that adding more neurons to form a network will allow us to perform more complex calculations that had non-linear decision boundaries. However, it wasn't until the 1980s when the backpropagation method was popularized that we really had a good way to train these neural network models with multiple layers. Let's look at what happens when we stack multiple perceptrons together. We could stack them in a couple of ways. We could take a couple of perceptrons and put them side-by-side, but we could also take the outputs of those two perceptrons and feed it into yet another perceptron to go through its calculations and generate a final output. Turns out then when we stack these perceptrons together like this, we could perform much more complex calculations than we could using only a single artificial neuron. Let's look at an example to illustrate this. Let's say we have a problem where we're trying to generate a model to predict some output, which is a binary classification output, either a plus 1 or minus 1, and as an input we have two features X_1 and X_2. A decision boundary, looks like this, as we can see on the slide between the plus 1 class and the minus 1 class. To approach this problem, we could start by taking two individual perceptrons. We could train each perceptron so that they were capable of creating linear decision boundaries between the minus 1 and the positive 1 class like this. We can then take the output of each of those individual perceptrons, feed it into a third perceptron. Now our third perceptron would be capable of combining the outputs of the first and second perceptron and creating a non-linear decision boundary. In this way, our simple model consisting of three perceptrons organized in two layers can now approximate the input function that we we're trying to approximate. The exercise we just looked at was a simple example using a binary classification task. But what happens if we have more than two possible output classes? Let's say we're classifying animals or flowers with many different classes. Rather than using only a single output, a single unit in the output layer, we can use multiple units in the output layer. We again combine our perceptrons into layers where we have an input layer consisting of our input features. Our input features feed into a set of perceptrons in what we call a hidden layer, and we take the outputs of those perceptrons in our hidden layer and feed it into another layer of multiple perceptrons in our output layer. We then have an output from each of these perceptrons in their our output layer, and the output from each of these represents a score for each of the classes in our problem. We then look at which class has the highest associated score, and we assign that class label to the input data point. Additionally, when we combine perceptrons or artificial neurons together, rather than using a perceptron as the node in our network, which has a very simple threshold function, we can choose to use a unit that includes an activation function, such as a sigmoid function, as we saw in logistic regression. But we can also use other functions such as the hyperbolic tangent or the ReLu function, which is now very commonly used as an activation function. Each node in our network will now consist of a linear combination of our inputs times our weights to calculate the Z, and it passing our Z through a non-linear activation function and providing the output on to the next layer of the network. The use of these non-linear activation functions in each layer, rather than a simple threshold like the perceptron, enables us to better model nonlinear relationships. Let's take a look at a typical neural network architecture and they're on their work begins with an input layer which consists of each of the features in our input data. The features are then multiply by a weight and fed into each of the nodes within our first layer, which we call our hidden layer. Again, we take a linear combination of each of the input features times each of the weights, calculate our Z and pass our Z through Phi of Z, which is our activation function. Again, we get to choose our activation function. This could be sigmoid or it could be a ReLu function. We take the output of that activation function and we feed it into the next layer. In this simple example, we have a three-layer neural network. We have a set of inputs that we're providing, multiplying those buyer weights, combining them in our hidden layer, passing them through our activation function, and feeding that into our output layer. Those are then combined in our output layer, multiplying those outputs from the previous hidden layer times the weights, and then again pass through an activation function over output layer to calculate our y hat or our prediction from our simple neural network.