A batch normalization layer dynamically normalizes its inputs depending on the mini-batch statistics. This has several positive effects:
The covariate shift over batches during training will be reduced.
The learning rate can be increased since there is a much smaller risk of getting very large input values.
Overfitting is reduced, since the statistics for each batch will be slightly different, adding a bit of noise to the training. This often leads to that the dropout rate can be lowered.
During training, the batch normalization layer is also keeping and updating population statistics, which is later used in inference time (when using batch statistics might not be suitable).
The operation is implemented as:
Trainable: whether we want the training algorithm to change the value of the weights during training. In some cases, one will want to keep parts of the network static, e.g., when using the encoder part of an autoencoder as preprocessing for another model.