We've talked about neural networks in previous courses and modules, but now, let's learn some of the science behind them. We recently saw that feature crosses did very well in a problem like this. If x1 is the horizontal dimension, and x2 is the vertical dimension, there was no linear combination of the two features to describe this distribution. It wasn't until we did some feature engineering and crossed x1 and x2 to get a new feature x3 which equals x1 times x2 where we are able to describe our data distribution. So, manual handcraft to feature engineering can easily solve all of our nonlinear problems. Right? Unfortunately, the real world almost never has such easily described distributions. So, feature engineering, even after years the brightest people working on it, can only get so far. For instance, what feature crosses would you need to model this distribution? It looks like two circles over top of each other or maybe two spirals, but whatever it is, it's very messy. This example sets up the usefulness of neural networks so they can algorithmically create very complex feature crosses and transformations. You can imagine much more complicated spaces that even this spiral that really necessitate the use of neural networks. Neural networks can help as an alternative to feature crossing by combining features. When we were designing our neural network architecture we want to structure the model in such a way that there are features combined. Then we want to add another layer to combine our combinations and then add another layer to combine those combinations and so on. How do we choose the right combinations of our features and the combinations of those etc? You get the model to learn them through training of course. This is the basic intuition behind neural nets. This approach isn't necessarily better than feature crosses, but is a flexible alternative that works well in many cases. Here is a graphical representation of a linear model. We have three inputs x1, x2 and x3 shown by the blue circles. They're combined with some weight given on each edge to produce an output. They're often is an extra biased term, but for simplicity it isn't shown here. This is a linear model since is a form of y equals w1 times x1, plus w2 times x2, plus w3 times x3. Now, let's add a hidden layer to our network of nodes and edges. Our input layer has three nodes and our hidden layer also has three. But now, hidden nodes. Since this is a completely connected layer, there are three times three edges or nine weights. Surely, this is a nonlinear model now that we can use to solve our nonlinear problems, right? Unfortunately, not. Let's break it down. The input to the first hidden node is the weighted sum of w1 times x1, plus w4 times x2, plus w7 times x3. The input to the second hidden node is the weighted sum w2 times x1 plus w5 times x2 plus w8 times x3. The input to the third hidden node is the weighted sum w3 times x1 plus w6 times x2 plus w9 times x3. Combining it all together at the output node, we have w10 times h1, plus w11 times h2, plus w12 times h3. Remember though, that h1, h2 and h3 are just linear combinations of the input features. Therefore, expanding it out, we're left with a complex set of weight constants multiplied by each input value x1, x2, and x3. We can substitute each couple of weights for a new weight. Look familiar? This is exactly the same linear model as before despite adding a hidden layer of neurons. So, what happened? What if we added another hidden layer? Unfortunately, this once again and collapses all the way back down into a single weight matrix multiplied by each of the three inputs. It is the same linear model. We can continue this process add infinitum and it would still be the same result, albeit a lot more costly computationally for training or prediction for a much, much more complicated architecture than needed. Thinking about this from a linear algebra perspective, you're multiplying multiple matrices together in a chain. In this small example, I first multiply a three by three matrix, the transpose of the weight matrix between the input layer and a hidden layer one, by the three by one input vector resulting in the three by one vector which are the values at each hidden neuron in the hidden layer one. Define their second hidden layers neuron's values, I multiplied the transpose of it's three by three weight matrix that connects hidden layer one with hidden layer two to my resultant vector at hidden layer one. As you can guess, the two, three by three weight matrices can be combined into on, three by thre matrix by first calculating the matrix product from the left inside or from the right. This still gives the same shape for h2, the second hidden layer neuron's value vector. Adding in the final layer between hidden layer two and the output layer, I need to multiply the preceding steps by the transpose of the weight matrix between the last two layers. Even though when feeding forward through a neural network you perform the matrix multiplication from right to left by applying it from left to right, you can see that our large chain of matrix complications collapses down into just a three valued vector. If you train this model in just a simple linear regression case of three weight side by side and they both fall into the same minimum on the low surface, than even though I did a ton of computation to calculate all 21 weights my matrix product chain will condense down into the lower equation, the weight will exactly match the training simple linear regressions weights. All of the work for the same result. You're probably thinking now, "Hey, I thought neural networks are all about adding layers upon layers in neurons. How can I do deep learning when all of my layers collapse into just one?" I've got good news for you. There is an easy solution. The solution is adding a non-linear transformation layer which is facilitated by a nonlinear activation function such as sigmoid, Tanh or ReLU. And thinking of terms of the graph such as you're making TensorFlow, you can imagine each neuron actually having two nodes. The first node being the result of the weighted sum wx plus b, and the second node being the result of that being passed through the activation function. In other words, there are inputs of the activation function followed by the outputs of the activation function, so the activation function acts as the transition point between. Adding in this non-linear transformation is the only way to stop the neural network from condensing back into a shallow network. Even if you have a layer with nonlinear activation of functions your network, if elsewhere in the network you have two or more layers with linear activation functions, those can still be collapsed into just one network. Usually, neural networks have all layers nonlinear for the first and minus one layers and then have the final layer transformation be linear for regression or sigmoid or softmax which we'll talk about soon for classification. It all depends on what you want the output to be. Thinking about this again from a linear algebra perspective when we apply a linear transformation to a matrix or vector, we are multiplying a matrix or vector to it leading to our desired shape and result. Such as when I want to scale a matrix, I can multiply it by a constant. But truly, what you are doing is multiplying it by an identity matrix, multiplied by that constant. So, it is a diagonal matrix with that constant all on the diagonal. This would be collapsed into just a matrix product. However, if I add a non-linearity, what I am doing is not able to be represented by a matrix. Since, I am element y it's applying a function into my input. For instance, if I have a nonlinear activation function between my first and second hidden layers, I'm applying a function of the product of the transpose of my first hidden layers weight matrix and my input vector. The lower equation is my activation function in a ReLU. Since I cannot represent the transformation in terms of linear algebra, I can no longer collapse that portion of my transformation chain thus complexity to my model remains and doesn't collapse into just one linear combination of the inputs. Note that I can still collapse second hidden layer of weight matrix and the output layer weight matrix since there is no nonlinear function being applied here. This means that whenever there are two or more linear layers consecutively, they can always be collapsed back into one layer no matter how many they are. Therefore, they have the most complex functions being created by your network, it's best to have your entire network have a linear activation functions, except at the last layer in case you might use a different type of output at the end. Why is it important adding non-linear activation functions to neural networks? The correct answer is because it stops the layers from collapsing back into just a linear model. Not only do nonlinear activation functions help create interesting transformations through our data scripture space, but it allows for deep compositional functions. As we explained, if there are any two or more layers with linear activation functions, this product of matrices can be summarized by just one matrix times the input feature vector. Therefore, you end up with slower model with more computation but with all of your functional complexity reduced. Non-linearities do not add regularization to the loss function and they do not invoke early stopping. Even though nonlinear activation functions do create complex transformations in the vector space, that dimension does not change it remains the same vector space. Albeit stretched, squished or rotated. As mentioned in one of our previous courses, there are many nonlinear activation functions with sigmoid, and the scaled and shifted sigmoid, hyperbolic tangent being some of the earliest. However, as mentioned before, these can have saturation which leads to the vanishing gradient problem, where with zero gradients, the models weights don't update and training halts. The Rectified Linear Unit or ReLU for short is one of our favorites because it's simple and works well. In the positive domain it is linear, so we don't have saturation whereas the negative domain the function is zero. Networks with ReLU hidden activation, often have 10 times the speed of training than networks with sigmoid, hidden activations. However, due to negative domains function always being zero, we can end up with the real layers dying. What I mean by this is that, when you start getting inputs in the negative domain and the output of the activation will be zero, which doesn't help in the next layer and given inputs in the positive domain. This compounds and creates a lot of zero activations, during back propagation when updating the weights since we have to multiply our errors derivative by their activation, we end up with a gradient of zero. Thus, a weight of data zero and thus the weights don't change and the training fails for that layer. Fortunately, a lot of clever methods have been developed to slightly modify the ReLU to ensure training doesn't stall, but still, with bunch of the benefits of the vanilla ReLU. Here again is the vanilla ReLu, the maximum operator can also be represented by the piecewise linear equation, where less than zero, function is zero. And greater than or equal to zero, the function is X. A smooth approximation of the ReLUs function is the analytic function of the natural log of one, plus the exponential X. This is called the Softplus function. Interestingly, the derivative the Softplus function is a logistic function. The pros of using the Softplus function are, it's continuous and differentiable at zero, unlike the ReLu function. However, due to the natural log and exponential, there's added computation compared to ReLUs, and ReLUs still have as good of results in practice. Therefore, Softplus is usually discouraged to be using deep learning. To try and solve our issue of dying ReLUs due to zero activations, the Leaky ReLU was developed. Just like ReLUs, Leaky ReLUs have a piecewise linear function. However, in the negative domain, rather than zero, they have a non-zero slope specifically, 0.01. This way, when the unit is not activated, Leaky ReLUs still allow a small non-zero gradient to pass through them, which hopefully will allow weight updating and training to continue. Taking this Leaky idea one step further is the parametric ReLU or PReLU for short. Here, rather than arbitrarily allowing one hundredth of an X through in the negative domain, it lets Alpha of X through. But, what is the parameter Alpha supposed to be? In the graph, I set Alpha to be 0.5 for visualization purposes. But in practice, it is actually a learned parameter from training along with the other neural network parameters. This way, rather than us setting this value, the value will be determined during training via the data and should learn a more optimal value than we our priority could set. Notice that when Alpha is less than one, the formula can be rewritten back into the compact form using the maximum. Specifically, the max of X or alpha times x. There are also randomized Leaky ReLUs where instead of Alpha being trained, it is a sampled from a uniform distribution randomly. This can have an effect similar to drop out since you technically have a different network for each value of Alpha. And therefore, it is making something similar to an ensemble. At test time, all the values of Alpha are averaged together to a deterministic value to use for predictions. There is also the ReLU6 variant, this is another piecewise linear function with three segments. Like a normal ReLU, it is zero in the negative domain, however the positive domain the ReLU6 is kept at six. You're probably thinking, "Why is it kept at six? You can imagine one of these ReLU units having only six replicated by a shifted bernoulli units, rather than an infinite amount due to the hard cap. In general, these are called the ReLU n units, where n is the cap value. In testing, six was found to be the most optimal value. ReLU6 units can help models learn sparse features sooner. They were first used convolutional deep elite networks on a CIFAR-10 image data set. They also have useful property of preparing the network for fixed point precision for inference. If the upper limit is unbounded, then you lose too many bits to the Q part of a fixed point number, whereas with an upper limit of six, it leaves enough bits to the fractional part of the number making it represented well enough to do good inference. Lastly, there is the exponential linear unit or ELU. It is approximately linear in the non-negative portion of the input space and is smooth, monotonic and most importantly, non-zero in the negative portion of the input. They are also better zero centered than vanilla ReLUs which can speed up learning. The main drawback of ELUs are that they are more compositionally expensive than ReLUs due to having to calculate the exponential. Neural networks can be arbitrarily complex, there can be many layers, neurons per layer, outputs, inputs, different types activation functions et cetra. What does the purpose of multiple layers? Each layer I add, adds the complexity of the functions I can create. Each subsequent layer is a composition of the previous functions. Since we are using nonlinear activation functions in my hidden layers, I'm creating a stack of data transformations that rotate, stretch and squeeze my data. Remember, the purpose of doing all of this is to transfer my data in such a way that can nicely fit hyper plane to it for regression or separate my data with a hyper planes for classification. We are mapping from the original feature space to some new convoluted feature space. What does adding additional neurons to a layer do? Each neuron I add, adds a new dimension to my vector space. If I begin with three input neurons, I start in R3 vector space. But if my next layer has four neurons that I moved to an R4 vector space. Back when we talked about Kernel methods in our previous course, we had a data set that couldn't be easily separated with a hyper plane in the original input vector space. But, by adding the dimension and then transform the data to fill that new dimension in just the right way, we were then easily able to make a clean slice between the classes of data. The same applies here with neural networks. What might having multiple output nodes do? Having multiple output nodes allows you to compare to multiple labels and then propagate the corresponding areas backwards. You can imagine doing image classification where there are multiple entities or classes within each image. We can't just predict one class because there maybe many, so having this flexibility is great. Neural networks should be arbitrarily complex. To increase hidden dimensions, I can add blank. To increase function composition, I can add blank. If I have multiple labels for example, I can add blank. The correct answer is neuron's, layers, outputs. To change hidden dimensions, I can change the layers number of neurons. So that determines the dimensions of the vector space like intermediate vector is in. If a layer has four neurons, than it is in our four vector space, and if a layer has 500 neurons it is in R 500 vector space. Meaning, it has 500 real dimensions. Adding a layer doesn't change the dimension of the previous layer and it might not even change the dimension in its layer unless it has a different number of neurons in the previous layer. What additional layers do add is a greater composition of functions. Remember, Go of F of X, is the composition of the function G with the function F on the input X. Therefore, I first transform X by F and then transform that result by G. The more layers I have, the deeper the nested functions go. This is great for combining non-linear functions together to make very convoluted feature maps that are hard for humans to construct but great for computers, and allow us to better get our data into a shape that we can learn and gain insights from. Speaking of insights, we receive those through our output layers, where during inference, those will be the answers to our ML formulated problem. If you only want to know the probability of an image being a dog, then you can get by with only one output node. But if he wanted to know the probability of an image being a cat, dog, bird or a moose, then you would need to have a node for each one. The other three answers are all wrong since they get two or more of the words wrong.