The next component of Pix2Pix is the upgraded generator that is based on unit. So first let's take a look at what the U-Net architecture framework looks like. It's an encoder-decoder model that uses skip connections in between as well, and then you'll see how this plays in as Pix2Pix generator. So U-Net in general has been a very successful computer vision model for image segmentation. So what segmentation is is taking a real image and getting a segmentation mask or labels on every single pixel that image in terms of what object it is. So you're labeling cards, you're labeling crosswalks, you're labeling trees, you're labeling the road, you're labeling pedestrians. Hopefully this is for self-driving car application. And the segmentation task is very much an image-to-image translation task, but there is a correct answer in terms of what each pixel is and which class each pixel belongs to. So a pixel on a car is definitely going to be a car, and you can't segment it any other way. But when it comes to generating things or what Pix2Pix is really good at, it's something without a correct answer. So this image of this car here, there's no correct answer in terms of what that car could be. I mean, it could be this car back here in that image, but since it's just a segmentation mask, realistically it could be a different car, any of these other cars. In all of these, cars are technically correct if you were to look at it as a human, right? So this is also an image-to-image translation task, and now what's just important to note is that Pix2Pix uses the U-Net architecture for the generator architecture. Because it's good at taking in an input image and mapping it to an output image. And typically it's used for just image segmentation, but Pix2Pix wants to use it for this generation task as well. And what Pix2Pix does is largely be able to go back and forth between these two. And so remember that the traditional generator architecture takes in a tiny noise vector. The U-Net generator actually takes in an entire image, so that's what this x is here. And so this means that the generator has to be a lot beefier with convolutions to handle that image input. So the architecture of the Pix2Pix generator is this encoder-decoder structure, where first you encode things. And you can imagine an encoder being close to a classification model because you take in an image, and then you output maybe a value here of what class it is, a cat, or a dog. And then you can also think of this middle layer as being a bottleneck where it is able to embed the information in this image. So all that important information in that image is compressed into this little bit of space, just so you can get those high level features, those important features, and to then decode it an output y, another image. And this might remind you of an auto-encoder except for an auto-encoder, you won't want to be as close as possible to x, and here you don't want that. You want y to be a different style conditioned on x. However, since it's easy to overfit these networks to your training image pairs, U-Net also introduces skip connections from the encoder to decoder, and this is also really useful. I getting information that might have been lost during the encoding stage. So what happens is that during the encoding stage, every single block that is the same resolution as its corresponding block in the decoding stage get this extra connection to go concatenate with that value. Such that information that might have been compressed too much can still trickle through and still get to some of these later layers. And so these skip connections are concatenated from the encoder before going into each convolutional block in the decoder. And skip connections are pretty standard in convolutional neural networks, CNNs, and they largely just allow information to flow from earlier layers to later layers. In this information could be added or concatenated, but somehow included into the later layers, and here it's concatenated. So this makes it so that it's easy to get certain details that the encoder may have lost in the process of downsampling to the decoder, and that means those finer grain details. And so it's largely about information flow, and this is in the forward pass, of course. In the backward passes, skip connections can also improve this gradient flow as you go backwards. And so skip connections were by and large introduced to help with the vanishing gradient problem when you stack too many layers together. So the gradient gets so tiny when it's multiplied in back prop, limiting our networks from going deeper and having more layers. And what's cool in U-Net that very much in the backward pass, this does improve gradient flow to the encoder so that those layers learn from information that might have been in the decoder here, too. So first, you have your encoder, which takes in an image x, of size, let's say 256 by 256 height width, with three channels of color, RGB. And in this example of this segmented image that is inputted as essentially that conditioned image, and it goes through all of these blocks. It goes through eight encoder blocks to compress that input. And then each block downsamples the spatial size by factor of two. So the output size of the encoder is 256 divided by 2 to the 8th or 1 x 1 spatial size by the very end, with 512 channels to encode that information. So at the very end is just as 1 x 1 here, height width, and each of these encoder blocks contains a convolutional layer, a BN norm layer, and LeakyReLU activation. And so this might not come as a huge surprise to you as you've seen multiple times that it is a layer of these blocks that are going on similar to style GAN. And so the convolutions will make your input smaller by having the height and width with stride of 2. Actually, it's specifically what these have. So note that a lot of the convolutions you've been seeing have a stride of 1, these have astride of two. Meanwhile, on the decoder side you have this input size of 1 x 1, this tiny, tiny input, and you have eight blocks again, but these are decoder blocks. And because you want to generate an output image with the same size as the encoder input, the decoder actually contains the same number of blocks as the encoder. And then you get y as output which is your generated image, which is the same size as your input 256 by 256 times 3 channels for color. In each decoder block, which might not come as a surprise either, is composed of a transposed convolution. So that takes your input, and makes your output bigger, followed by a BatchNorm, and then a ReLU activation function. Dropout is added to this network, but it's actually just added to the first three blocks of this decoder. Drop out randomly. Disabled different neurons at each iteration of training to allow different neurons to learn, and this is so that the same neurons aren't always learning the same thing. And this stochasticity, this noise added to the network. Please note that this is only present during training, and as with all uses of Dropout, it is typically turned off during inference or test time. And that inference neurons are actually scaled by this inverse dropout probability to maintain the same kind of distribution that they expect. That's not super important to know, but just know that Dropout does add some kind of noise to this model. Remember that we're taking away the noise as input right now, and so this is where stochasticity does seep into this model architecture, but only during training. During inference, you're not going to see that type of stochasticity of course, nor be able to inject that type of noise like that. And so putting the two halves together, you can get this full encoder-decoder structure here, where the encoder takes in an input of 256 by 256, and outputs the same size output. That's a generated image. In the information about the input is passed through this small bottleneck of size 1 by 1 and this small spatial size can be understood as summarizing the encoding of that input image. And then you can think of the decoder as performing the inverse operations as the encoder, which is why they contain the same number of blocks, or eight blocks. So remember that the U-Net framework is a variation of the encoder-decoder framework, where the encodings from every single encoder block or level, are past or forwarded to the corresponding blocker level in the decoder at the same resolution. And U-Net integrates this information from the encoder into the decoder by using concatenation. So concatenating the encoder outputs to each decoder input at each blocker level, where it's the same resolution, so it's easy to concatenate them. So in summary, Pix2Pix uses a unit for its generator, and a U-Net is an encoder-decoder framework that uses skip connections that concatenate same resolution or same block or same level. Feature maps to each other from the encoder to the decoder, and this helps the decoder learn more details from the encoder directly, in case there are finer details that are lost during the encoding stage. And the skip connections also help in the backwards layer, of course to help more gradient to flow from the decoder back to the encoder.