Let’s discuss how these layers work to summarize and extract meaning from pixel data. Convolutional layers use filters to scan over local regions in the input space, which are just collections of pixels near one another, and capture information about features in those regions. Each filter is a small square matrix, usually much smaller than the actual image like 3x3 or 5x5, with learnable parameters. The filter is tiled across the input space, calculating the cross-correlation between the filter and a small region of the input space at each tiling. These values are then used to create a small representation of the input space. Each time the filter is applied, a single number is generated, so the filter shrinks the input space. When we have three color channels in the input data, we have filters for each color channel. So instead of using a 3x3 or 5x5 filter, we would end up with a 3x3x3 or 5x5x3 filter. In the following discussion, we focus on explaining how the convolutional layer processes data from one channel, but this process occurs for each channel. The tiling on the filter across the input space is used to create the feature map, computed by calculating the cross-correlation between the filter and the same-sized region in the input space. This cross-correlation is simply the dot product between the filter and the input space. We just take products between corresponding cells and then sum the results. The numbers in the filter are the learned parameters that we train, and a bias is included as another learned parameter. The number computed from this calculation is then saved in the feature map in the location corresponding to where the filter is on the original image. The stride is the number of steps we take when we move the filter to a new location. In this case, a stride of 1 means we move the filter one pixel to the right on the input space, generating a value for the feature map at each position. This process is repeated, moving the filter all over the input space to generate the feature map. The stride is used to determine both the horizontal and vertical movement of the filter. A stride of 2 would mean moving the filter two pixels and would create a smaller feature map than a stride of 1. In this case, a stride of 2 would be invalid because at the end, there would be an extra column in the input space ignored by the filter. The size of the input space can be slightly modified by using padding dimensions, which are rows and columns of zeros added to the edge of the input space. The filter can then be tiled on the edges, multiplying some of its weights by the zero padding dimensions. The idea is to include more information in the feature map about the boundary than would be preserved without the padding dimensions. Adding zeros means that the values in the feature map won’t be affected by the values in the padding dimension, but calculations can be performed with the filter on the edge of the input space a greater number of times. The choices of filter dimension, padding dimension, and stride depend on both the size of the input dimension and the desired size of the feature map, which must have an integer dimension. Thus, the formula relating the feature map dimension to the input dimension, padding dimension, feature map, and stride is really a constraint that says the function of those hyperparameters must be an integer. The selection of the stride and the filter dimension thus depend on the size we want for the feature map. A larger feature map encodes more information from the original image. In the example we went through, we take a 6x6 input space and, using a 3x3 filter with a stride of 1, we create a 4x4 feature map, which is more than a 50% decrease in overall data size.