The 2D Convolution block represents a layer that can be used to detect spatial features in an image, either working directly on the image data or on the output of previous convolution blocks.
Each layer is composed by a configurable number of filters, where each filter is a HxWxC matrix of trainable weights; a convolution operation is performed between the image and each filter, producing as output a new image (tensor in DL) with height and weight determined by the input image, stride and padding (the output height and weight are inversely proportional to the stride) and as many channels as the number of filters. Every value in the tensor is then fed through an activation function to introduce a nonlinearity.
Each pixel in the image represents how strongly the corresponding feature (which could be an edge, a color gradient in the original image, or a certain configuration of edges in a deeper layer of the network) is present in the HxW area centered on that pixel.
This is a nice example of a filter that can detect edges; while historically computer vision and image processing relied on fixed shape feature detectors, convolutional neural networks learn the best filter shapes for the task at hand.
Every layer has (filter width) x (filter height) x (number of channels in the layer input) x (number of filters) weights. This is typically much smaller than the size of the input image, unlike what would happen for dense layers (where each pixel would get a weight for every neuron).
The default is to move filters by 1 pixel at a time when performing convolutions; this is called stride and it can be altered by the user. The bigger the stride, the smaller the output image will be along the corresponding axis. This can be used to reduce the number of parameters and memory used, but leads to a loss of resolution.
Convolutional layers need many fewer parameters than dense layers to process an image, because the same feature is checked for in every part of the input data (this is what the convolution operation does) We can do this because look the same irrespective of the position in the image.
Number of filters: The number of convolutional filters to include in the layer
Width of filter: The width of the weight matrix of a single filter. Default: 3
Height of filter: The width of the weight matrix of a single filter. Default: 3
Horizontal stride: The number of pixels to move while performing the convolution along the horizontal axis. Default: 1
Vertical stride: Default: The number of pixels to move while performing the convolution along the vertical axis.
Activation: The function that will applied to each element of the output. Default: ReLu
Padding: Same (output (height) x (width) is the same as the input) or valid (output (height) x (width) is smaller than the input)
Trainable: whether we want the training algorithm to change the value of the weights during training. In some cases one will want to keep parts of the network static, for instance when using the encoder part of an autoencoder as preprocessing for another model.