Another form of regularization that helps us build more generalizable models, is adding dropout layers to all neural networks. To use dropout, I add a wrapper to one or more of my layers. In TensorFlow, the parameter you passed is called dropout, which is the probability of dropping a neuron, temporarily from the network, rather than keeping it turned on. You want to be very careful when setting this number, because for some other functions that have a dropout mechanism they use and keep that probability, which is the complement to the drop probability. You don't want to intend only a 10% probability to drop, but actually now keeping 10% of your nodes randomly, that's a very unintentional sparse model. So let's talk a little bit about how dropout works under the hood, let's say we set a of 20%, that means that on each forward pass the network, the algorithm will roll the dice for each neuron, in the dropout wrapped layer. If a dice roll is greater than a 20 so mean you're using a 100 sided die, then the neuron will stay active in the network. But if you've got 20 or below, then your neuron will be dropped out and output a value of zero regardless of its inputs, effectively not adding any negativity or positivity to the network, since adding zero changes, nothing, it simulates the neuron doesn't even exist. To make up for the fact that each node is not only kept for some percentage of the time, the activations are then scaled by one over one minus the dropout probability. Or in other words, one over keep probability, during training sets the expectation of that value during activation. When not doing training, without having to change any of the code, the wrapper effectively disappears, and the neurons normally in the former dropout layer, are always on, and use whatever weights were trained by the model. Now, the awesome idea about dropout, it essentially creates an ensemble model, because for each one we pass, there's effectively a different network that the mini batch of data is seeing as it goes through. When all of this is added together in the expectation, it's like I trained through the n neural networks, where n is the number of dropout neurons, and have them working together in an ensemble similar to a bunch of decision trees, working together in a random forest. There is also the added effect of spreading out the data distribution, over the entire network, rather than having a majority of signal favor going just along one branch of the network, because those of those neurons could get dropped out. I usually imagine this as diverting water in a stream or river with multiple shots or dams or rocks, to ensure all waterways eventually get some water, and don't dry up. This way your network uses more of its capacity, since the signal more evenly flows across the entire network, and thus you'll have better training in generalization without large dependencies on certain neurons being developed in particular paths. Okay, so we mentioned 20 before, what's a good dropout percentage for your neural network? Well, typical values for dropout are anywhere between 20 to 50%,. If you go much lower than that, there's not really that much of an effect on the network, since you're rarely dropping out any nodes. But if you go higher than that, the training doesn't happen as well since the network itself becomes too sparse, to have the actual capacity to learn the data distribution, more than half the network is going away and each forward pass. You'll also want to use this on larger networks, because there's more capacity for the model to learn independent representations, in other words, there are more possible paths for the network to try. Now, the more you drop out, the less you keep, the stronger the regularization. If you set your dropout probability to 1, then you keep nothing in every neuron is wrapped in a drop out layer, is effectively removed from the network and outputs a 0 during activation. During backprop, this means that the weights won't update, and the layer will learn nothing. Now that's if you set the probability to 1 for dropout. On the other side of the spectrum, if you set your probability to 0, then all the neurons are kept active, and there's no dropout regularization. So it's pretty much just more computationally costly way, to not have a dropout wrapper at all. So again, you can adjust that hyper parameter 20 to 50%, see what works well for your models. And of course somewhere between 0 and 1 is where you want to be. Particularly drop outs between 10 to 50%, where a good baseline is starting around 20, and then adding more as needed. Keep in mind, there's no one size fits all for drop out probability, for all models and all data distributions, that's where your expertise, and trial and error come into play.