In this video, you'll learn about Frechet inception distance, or FID, the most popular metric for measuring the feature distance between real and generated images. So first you learned about Frechet distance and then see how you can apply it to real and fake embeddings as Frechet inception distance, or FID, to see how far apart they are. Finally, you'll learn about a few of FID's limitations. Evaluation is very much an open area in generative models research. So Frechet distance named after the mathematician, Maurice Frechet is a distance metric that is used to measure the distance between curves and can be extended to comparing distributions as well. The dog walker is a classic example used to illustrate Frechet distance, where the dog is on one curve and the walker is on the other. So the dog hears on blue and the walker here is on this orange. Each can go at their own speeds, but neither of them can go backwards. So keep that in mind. So to calculate the Frechet distance between these curves, you need to figure out the minimum leash length needed to walk the curves from beginning to end. That is, what is the least amount of leash you can give your dog without ever having to give them more slack during the walk. So that's the intuition behind Frechet distance between two of these curves here. You can also calculate the Frechet distance between distributions. There are many distributions for which the Frechet distance between two distributions is analytically solved. For example, there's a simple formula to calculate the Frechet distance between two single dimensional, normal distributions. Lets look at both of the distributions means, which is represented by Mu here, as well as their standard deviations represented by Sigma. So one distribution is X and one is Y. So you can imagine X and Y where X is the dog walker and Y is for the dog. The mean gives you a sense of their center and the standard deviation gives you a sense of their spread. Notationally, you can take the difference between the means and the difference between their standard deviations, and then square each of these differences to penalize values further away from each other and also as a notion of our distance in some way. So keeping this in mind, let's take a quick detour to talk about multivariate normal distributions. Multivariate normal distributions generalize the idea of a normal distribution to higher dimensions. These extra dimensions allow you to model much more complex distributions in just one parameterize by a single mean and a single standard deviation. This is important because the features of your real data probably aren't distributed that nicely in just a single normal distribution. There are probably peaks in different areas, modes around certain features like a black nose for dogs or a tongue sticking out. A multivariate distribution is one to represent that. In a multivariate normal distribution has particularly nice properties. So it's often used. So thus far, the distributions you've seen are univariate, meaning it's single-dimensional. So the easiest way to visualize a multivariate normal distribution is to have a univariate normal distribution for each dimension; here and here. These together construct this dark circle looking thing, which is actually pointing straight out at you in three dimensions, where the darker the values, the higher its altitude is. So this resulting distribution illustrates how much each side varies based on the other, how much they affect each other. In here, both sides equally contribute to this center and they actually don't affect the other in doing so. So you have this nice round center, normal distribution peak coming out at you. But we can allow the dimensions to co-vary, meaning affect each other. This means that certain values in one dimension will cause values in another dimension to become more or less likely. So the resulting distribution could look more lopsided like this. Note that the normals here still look the same, even though they're affecting each other differently. So in order to express that, you want to use a covariance matrix, which generalizes the idea of variants, which you've seen as Sigma squared, where Sigma on its own is the standard deviation and Sigma squared is the variance. Remember that the variance quantifies the spread of this normal distribution. So covariance is aptly named because it measures the variance between two dimensions. So along the diagonals of this covariance matrix, you see variants within a dimension. So for example, dimension one with dimension one and here dimension two with dimension two. Dimension one is over here, dimension two, it's over there. But on the off diagonal you see these 0 values here. This 0 value measures dimension one to dimension two and this one is dimension 2 to dimension 1 and so what this is saying is that the ones here are telling you what the variance is within a dimension and it's just one and if these values were two for example, then you would see a much wider circle with much more spread and much lower peak because they've would be spread out much more. Now, a zero in any cell indicates no variance between those dimensions, so this covariance matrix, which has zeros in all of the off diagonal elements, means that the two dimensions are independent and again, if these values were different, say two or even if it's 0.5, then this peak would be even higher and the spread would be less but these two dimensions would still be independent. Technically here you don't need these covariance values assuming it's independent, to fully describe the spread of this distribution. However, a covariance matrix with non zero off diagonal elements indicates covariance between the two dimensions and a negative value indicates a negative correlation and you can think of it as the direction here of how this correlation is going. This is along the diagonal y equals negative x into a positive value in this covariance matrix would actually mean the opposite here, like this, trending outwards towards you along y equals x instead and a value of 0.5 would concentrate the spread more around the center with less magnitude in the covariance while negative 0.5 would do the same in the opposite direction. Now that you have some intuition behind multivariate normal distributions, let's recall the formula for Frechet Distance on the univariate normal distribution and that was in single dimension space, so it's the difference between the means and the difference between the standard deviations squared to get notion of distance between two of these dimensions and you can actually generalize this to the multivariate case, which just means comparing between two multivariate normal distributions that you were just looking at and this is by using its covariance matrices, capital sigma here. There are lots of parallels between the two formulas that can be generalized, now this might look daunting at first, so let's break that down. First, both formulas include the square of the distance between the means. The multivariate one simply takes the magnitude of the vector since it's no longer a single value and that's that norm here and so in the single dimension you can actually still write it like this, but it does evaluate to exactly this up here. You can also expand the formula for the difference between those standard deviations in the univariate case, so if you expand this square here, you will get this term and you can see that you end up with something that is very similar to the multivariate formula, where Tr refers to the trace of a matrix, which is just the sum of its diagonal elements. For example, two, negative one, negative one, 2 that you saw before, the trace would then only look at these elements and take the sum of them. The trace of this would be four and note that the sum of the diagonal elements of say, sigma X would actually just be the sum of the variances, i.e the covariance that each dimension has with itself not with the other dimension. The covariance between a dimension in another dimension isn't really considered here when the trace is being used on these guys. Also, everything seems squared in the univariate case. You have this square here and this is squared relative to this square root and that's because sigma is actually the standard deviation and sigma squared is the variance but this big sigma down here is the covariance and its generalization of sigma squared, the variance, so they do match and finally to be extra clear, this matrix square root down here gets the square root of the matrix not each individual element. This multivariate normal Frechet inception distance, what you need to take away here is just that it's very similar and its very much just a generalization of the univariate case by looking at differences between both the centers of those distributions and means, as well as the spreads of their distributions, the variances or covariances in this case. Now, how is this useful? The multivariate normal distribution can approximately model the many modes in your image features, from your real image feature embeddings, you can construct a multivariate normal distribution and this will be based on many, say, 50,000 real embedding vectors looking at how they're distributed and fitting in multinomial distribution to them, which just means finding a mean and covariance matrix across all of these examples and so now you do the same for your fakes, you get their feature embeddings. You fit a multivariate normal distribution on it and then now you can compare the two multivariate normal distributions. The reals, which I'll call X, the fakes, which I'll call Y, and that just corresponds to these up here, where Mu X is the mean of the real embeddings, Mu Y is the mean of the fake embeddings, Sigma X is the covariance matrix of the real embeddings, and Sigma Y is the covariance matrix of the fake embeddings, which are then again used here. Now, between the reals and the fakes, you can first apply embeddings to those images, fit a multivariate normal distribution to each, and then compute the FID score on them. This is called Frechet Inception Distance or FID and it's currently the most widely used GAN evaluation metric. You understood if Frechet distance and inception just refers to using the Inception-v3 network to extract those features, those embeddings for reals and for fakes. Stepping back, FID looks at the statistics of the real multivariate normal distribution and the statistics of the fake multivariate normal distribution, where the statistics are just mean and covariance matrices, and calculates a how far apart those statistics are from each other. Where the closer those statistics are to each other, the closer the fake embeddings model the real embeddings. Now, FID does assume that the embeddings take on a multivariate normal distribution. Not because it's always perfectly accurate, but mainly because it's approximate and easy to compute. When all is computed here, you're left with a number that represents the difference between the real and the fake distributions. If the fake is close to the real, meaning your model is generating fake outputs close to the real, the lower this difference this number is. A smaller value actually means that features in the reals and fakes are more similar to each other, so that means lower the FID, the better. This can sometimes trip people up because other evaluation metrics lean towards higher values being better, but now for FID, lower is better because it's how far apart fakes are from reals. Unfortunately, there isn't a super interpretable range for FID values. Wouldn't it be great if the values were between zero and one, and 0.5 meant half or fake? That's not the case for FID. But generally, the closer you are to zero, the better. Because at zero, your fakes are indistinguishable from the reals based on those statistics computed from many of their embeddings. To give a sense of the scale of this computation, recall that there are typically 2048 dimensions in a feature vector or embedding from the inception network. FID typically is calculated over a large number of samples, say at least 50,000 reals and 50,000 fakes. Using a large number of samples reduces the noise and selection bias you'd have if comparing between say, just a couple of samples. Additionally, the ImageNet pre-trained Inception-v3 model doesn't always get the features you want, or they don't always make sense for your generators task. For example, if your GAN is trained on MNIST to generate handwritten digits, the inception-v3 model might not be able to detect meaningful features since ImageNet is composed of natural photos like dogs and humans, and quite frankly, a lot of dogs. Also, as stated before, the sample size needs to be large for both sufficient coverage of samples and your FID score is biased. Which means that your FID score will change based on the number of samples you use, which isn't great because you're GAN isn't changing nor are the real samples, so the number of samples shouldn't impact the score. But lo and behold, FID is typically lower for larger sample sizes. The more simple you use, the better your GAN will seem to be, even if it actually isn't better, because you're using the same model to compare. This isn't a good property to have in a metric, but there's certainly research trying to overcome this, creating an unbiased FID, for example, but that method isn't widely used. The large sample size also leads the FID being slow to run. Lastly, there are a limited number of statistics gathered on those distributions. Only mean and covariance, which are the first few moments of a distribution or properties of that distribution. There are many other moments of a distribution, like skew and kurtosis that you might be familiar with. Just remember that mean and covariance don't cover all aspects of the distribution, and the distribution also is likely not exactly multivariate normal. We're largely getting just mean and covariance, because we're assuming that the distribution is multivariate normal, but the distribution of your samples also is likely not exactly multivariate normal, but that assumption is made to make the computation of statistics and comparisons much easier. These shortcomings unfortunately mean that you still need to qualitatively look over your samples to find the model you want and to debug your model. The problem is that existing evaluation metrics are not exactly the best indicator of progress now. Nevertheless, FID is still one of the most popular ways to evaluate your GAN because it's easy to implement and use. All right. That was a beefy video. In summary, you learned how FID calculates the difference between the real and fake embeddings by using Frechet distance. There are a few shortcomings to FID that make it, so you should still be babysitting your model and checking out it's samples instead of just comparing FID scores between saved model checkpoints during training, to decide which one is best.