The basic idea is that you first create and train an experiment containing a pretrained snippet, which blocks are set to not trainable, and some Dense blocks, which are trainable. Then, if you wish to finetune the network further, you duplicate this experiment, set all blocks to trainable and train the new experiment with a really low learning rate.
If you have images as input this is the way to do this.
Create a new experiment and click on a pretrained snippet in the Inspector to add it to the modeling canvas.
Choose weights that closely resemble the dataset that you want to use for this model, e.g., ImageNet if your input is images
Set Weights trainable to No. Why? See below.
Set the input feature.
Add one or two Dense blocks. Use two if you have large images and one if you have images close to 32x32
Add a Target block.
Make sure the number of target nodes matches the number of classes you need to predict, set loss to Categorical crossentropy and the activation in the last layer to Softmax.
Define suitable batch size to make sure your model fits in memory and click Run!
Select Duplicate in the Experiment option menu.
Set Copy with weights to the best epoch
Expand the VGG16 feature extractor.
Select all trainable blocks, i.e., 2D Convolution and Dense, by holding down Shift and click on each block.
Set them all to trainable.
Set the learning rate to 0.00001. That is, 1% of the default learning rate.
This method is used to avoid catastrophic forgetting, which can easily happen when the weights learned by an old model are changed to meet the objectives of a new model. Read more about catastrophic forgetting in this paper: Overcoming catastrophic forgetting in neural networks.
The nice part with pretrained snippets is that they have learned the useful representations from the dataset it has been trained on. This stored knowledge can be used in new experiments.
But you lose this knowledge if you then set the weights of the pretrained snippet to trainable, and then run the experiment. There is great risk that All stored knowledge will be lost after a batch or two. We don’t want that to happen!
Why? Because the weights in the last blocks of the model that you add are set at random. That is, they or totally off from the start. These are the weights you want to train.
When the first batch reaches the Target block, the loss function is calculated, the loss indicates the magnitude of error your model made on its prediction. If your predictions are totally off, your loss function will be high.
The loss is then used to determine how to change the weights in the last block to achieve a lower loss. This information on how to achieve a lower loss is then, in turn, propagated backward through the network and used in the same way for all earlier trainable blocks.
The loss in the first batch will be high due to the random weights in the last blocks. Now, if you then have set the weights in the pretrained snippet as trainable, there is a great risk that precisely the representations we wanted to transfer from the previous task is lost.