A batch normalization layer dynamically normalizes its inputs depending on the mini-batch statistics. This has several positive effects:
The covariate shift over batches during training will be reduced.
The learning rate can be increased since there is a much smaller risk of getting very large input values.
Overfitting is reduced, since the statistics for each batch will be slightly different, adding a bit of noise to the training. This often leads to that the dropout rate can be lowered.
During training, the batch normalization layer is also keeping and updating population statistics, which is later used in inference time (when using batch statistics might not be suitable).
The operation i implemented as: