In this video, we're going to talk about how deep learning and convolutional neural networks can be adapted to solve semantic segmentation tasks in computer vision. Semantic segmentation with convolutional neural networks effectively means classifying each pixel in the image. Thus, the idea is to create a map of full-detected object areas in the image. Basically, what we want is the output image in the slide where every pixel has a label associated with it. In this chapter, we're going to learn how convolutional neural networks can do that job for us. The naive approach is to reduce the segmentation task to the classification one. The idea is based on the observation that the activation map induced by the hidden layers when passing an image through a CNN could give us a useful information about which pixels have more activation on which class. Our plan is to convert a normal CNN used for classification to a fully convolutional neural network used for segmentation. First, we get a pre-trained convolutional neural network such as one pre-trained for classification and ImageNet, you can choose your own favorite models like AlexNet or VGG or ResNet, and then we convert the last fully connected layer into convolutional layer of receptive field one-by-one. When we do this, we gain some form of localization if we look out where we have more activation. The optional step is to fine-tune to fully convolutional network for solely in the segmentation task. Important point to note here is that the loss function we use in this image segmentation scenario is actually still the usual loss function we use for classification, multi-class cross entropy and not something like the L2 loss, like we would normally use when the output is an image. This is because despite what you might think, we're actually just assigning a class to each of our output pixels, so this is a classification problem. The problem with this approach is that we lose some resolution by just doing this, because the activation will downscale on a lot of steps. Example with a cyclist is on the slide. Different approach to solving semantic segmentation via deep learning is based on downsampling-upsampling architecture, where both left and right parts have the same size in terms of number of trainable parameters. This approach is also called the encoder-decoder architecture. The main idea is to get the input image with size, n times m, compress it with a sequence of convolutions, and then decompress it and get the output with the original size, n times m. How can we do that? To save the information, we could use skip connections or reserve all convolution and pooling layers by applying unpooling and transpose convolution operations in decoder's part, but at the same place as where max pooling and convolution is applied in convolutional part or encoder part of the network. A working example of such an architecture is the SegNet model featuring a VGG identical encoder or downsampling part, and the corresponding decoder or upsampling part. While possessing many learnable parameters, the model performed well for road signs classification on the CamVid dataset while slightly underperforming the segmentation of medical images. Let's look at the details of transpose convolution employed in the SegNet model. Basically, the idea is to scale up the scaled down effect made on all previous layers. Actually, the upsampling or transposed convolution forward propagation is a convolution back propagation. And the upsampling back propagation is a convolution forward propagation. The easiest way to obtain the result of a transposed convolution is to apply an equivalent direct convolution. Kernel and stride sizes remain the same. But now, we should use zero padding with appropriate size. For better understanding of downsampling-upsampling architecture, we need to study the mechanism of unpooling. The max pooling operation is not invertible. Therefore, one may consider a different approximation to the inverse of max pooling. The easiest way is to use resampling and interpolation. This means, taking an input image, re-scaling it to the desired size, and then calculating the pixel values at each point using an interpolation method, such as bilinear interpolation. Another idea to restore max pooling is a "Bed of nails" where we either duplicate or fill the empty block with the entry value in the top left corner and the rows elsewhere. Yet, another and effective mechanism is the following. We record the position called max location switches where we located the biggest values during normal max pooling. And then use their positions to reconstruct the data from the layer above. U-net, yet another model, is a downsampling-upsampling architecture illustrated on the slide. The downsampling part follows the typical architecture of a convolutional network. It consists of the repeated application of two three-by-three unpadded convolutions followed by a rectifier linear unit and a two-by-two max pooling operation with stride two for downsampling. At each downsamplings tab, we double the number of feature channels. Every step in the upsampling part consists of a transposed convolution of the feature map followed by a two-by-two convolution that has a number of feature, channels and upsamples the data, and a concatenation with a correspondingly cropped feature map from the downsampling part. And this is implemented via skip connection. This is convolved by two three-by-three convolutional layer each followed by a rectifier linear unit. The cropping is necessary due to the loss of border pixels in every convolution. Of the final layer, a one-by-one convolution is used to map each 64-component feature vector to the desired number of classes. In total, the network has 23 convolutional layers, U-net performs well on medical image segmentation tasks. To summarize, you can view semantic segmentation as pixel-wise classification. You could just directly apply a pre-trained convolutional neural network, however, encoder-decoder style architectures seemed to be more effective in these tasks. Decoder network that has to upsample the internal representation of the data use a specialized layer such as has transpose convolution and unpooling to increase spatial resolution of the produced representation ending up with a dimensionality same as the input image. Also, what people use a lot is skip connections that help propagate gradients back and forth along the network.