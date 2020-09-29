As you've seen previously, BCE Loss is used traditionally to train GANs. However, it has many problems due the form of the function it's approximated by. So in this video I'll introduce you to an alternative loss function called Wasserstein Loss, or W-Loss for short, that approximates the Earth Mover's Distance that you saw in the previous video. So to that end, first you'll see an alternative way to look at the BCE Loss function that's more simple and compact, and I'll show you how W-Loss is calculated, and I'll compare this loss with BCE Loss. So, BCE Loss is computed by a long equation that essentially measures how bad, on average, some observations are being classified by the discriminator, as fake and real. So, the generator in GANs wants to maximize this cost, because that means the discriminator is saying that its fake values seem really real, while the discriminator wants to minimize that cost. And so, this is often referred to as a Minimax game. And this very long equation for BCE Loss can be simplified as follows. The sum and division over examples M is nothing but a mean or expected value. In the first part, inside the sum, measures how bad the discriminator classifies real observations, where y equals 1, and 1 means real. And the second part measures how bad it classifies fake observations produced by the generator, where y of 1 means real, but 1 minus y, y of 0, means fake. W-Loss, on the other hand, approximates the Earth Mover's Distance between the real and generated distributions, but it has nicer properties than BCE. However, it does look very similar to the simplified form for the BCE Loss, and in this case the function calculates the difference between the expected values of the predictions of the discriminator. Here it's called the critic, and I'll go over that later, so I'm going to represent it with a c here. And this is c of a real example x, versus C of a fake example g of z. Generator taking in a noise vector to produce a fake image g of z, or perhaps you can call it x-hat. So the discriminator looks at these two things, and it wants to maximize the distance between its thoughts on the reals versus its thoughts on the fakes. So it's trying to push away these two distributions to be as far apart as possible. Meanwhile, the generator wants to minimize this difference, because it wants the discriminator to think that its fake images are as close as possible to the reals. I know that in contrast with BCE there are no logs in this function, since the critics outputs are no longer bounded to be between 0 and 1. So, for the BCE Loss to make sense, the output of the discriminator needs to be a prediction between 0 and 1. And so the discriminator's neural network for GANs, trained with BCE Loss, have a sigmoid activation function in the output layer to then squash the values between 0 and 1. W-Loss, however, doesn't have that requirement at all, so you can actually have a linear layer at the end of the discriminator's neural network, and that could produce any real value output. And you can interpret that output as, how real an image is considered by the critic, which, by the way, is now what we're calling the discriminator instead, because it's no longer bounded between 0 and 1, where 0 means fake, and 1 means real. It's no longer classifying into these two, or discriminating between these two classes. And so, as a result, it wouldn't make that much sense to call that neural network a discriminator, because it doesn't discriminate between the classes. And so, for W-Loss, the equivalent to a discriminator is called a critic, and what it tries to do is, maximize the distance between its evaluation on a fake, and its evaluation on a real. So, some of the main differences between W-Loss and BCE Loss is that, the discriminator under BCE Loss outputs a value between 0 and 1, while the critic in W-Loss will output any number. Additionally, the forms of the cost functions is very similar, but W-Loss doesn't have any logarithms within it, and that's because it's a measure of how far the prediction of the critic for the real is from its prediction on the fake. Meanwhile, BCE Loss does measure that distance between fake or a real, but to a ground truth of 1 or 0. And so what's important to take away here is largely that, the discriminator is bounded between 0 and 1, whereas the critic is no longer bounded ,and just trying to separate the two distributions as much as possible. And as a result, because it's not bounded, the critic is allowed to improve without degrading its feedback back to the generator. And this is because, it doesn't have a vanishing gradient problem, and this will mitigate against mode collapse, because the generator will always get useful feedback back. So, in summary, W-Loss looks very similar to BCE Loss, but it isn't as complex a mathematical expression. Under the hood what it does is, approximates the Earth Mover's Distance, so it prevents mode collapse in vanishing gradient problems. However, there is an additional condition on this cost function for it to work well and for it to be valid, and you'll see that in the following videos.