2D Convolution

The 2D Convolution block represents a layer that can be used to detect spatial features in an image, either working directly on the image data or on the output of previous convolution blocks.

A filter can detect edges; while historically computer vision and image processing relied on fixed shape feature detectors, convolutional neural networks learn the best filter shapes for the task at hand.

How does 2D convolution work?

2D Convolution
Figure 1. 2D Convolution. 1 filter (channel), height 3, width 3, stride 1, and 0 padding.

Each block is composed by a number of filters, where each filter is a Height x Width x Channels matrix of trainable weights.

A convolution operation is performed between the image and each filter, producing as output a new image, called output tensor, with height and weight determined by:

  • Size of the input image.

  • Stride. The output tensor’s height and weight are inversely proportional to the stride.

  • Padding.

  • As many channels as the number of Filters.

Every value in the tensor is then fed through an Activation function to introduce a nonlinearity.

Each pixel in the output tensor represents how strongly the corresponding feature is present in the HxW area centered on that pixel. A feature can be:

  • An edge.

  • A color gradient in the original image.

  • A certain configuration of edges in a deeper layer of the network.


The default is to move filters by 1 pixel at a time when performing convolutions; this is called Stride. The bigger the stride, the smaller the output image will be along the corresponding axis. This can be used to reduce the number of parameters and memory used, but leads to a loss of resolution.


Padding is the process of adding one or more pixels of zeros all around the boundaries of an image, in order to increase its effective size.

Convolutional layers return by default a smaller image than the input. If a lot of convolutional layers are strung together, the output image is progressively reduced in size until, eventually, it might become unusable. By padding an image (i.e., "increasing" its size) before a convolutional layer, this effect can be mitigated.


Number of filters: The number of convolutional filters to include in the layer

Width of filter: The width of the weight matrix of a single filter. Default: 3

Height of filter: The width of the weight matrix of a single filter. Default: 3

Horizontal stride: The number of pixels to move while performing the convolution along the horizontal axis. Default: 1

Vertical stride: Default: The number of pixels to move while performing the convolution along the vertical axis.

Activation: The function that will applied to each element of the output. Default: ReLu

Padding: Same results in padding the input such that the output has the same length as the original input. Valid means "no padding".

Trainable: Whether we want the training algorithm to change the value of the weights during training. In some cases one will want to keep parts of the network static, for instance when using the encoder part of an autoencoder as preprocessing for another model.

Was this page helpful?
Yes No