Estimated time: 120 min

Classifying fruit

How to do an off-platform analysis of a classifier model

In this tutorial, you will apply transfer learning to solve a classification problem where the dataset has more than one hundred classes. The evaluation metrics on the Platform will indicate perfect validation metrics. But, as is often the case in machine learning, this does not necessarily mean that the model will always perform well on new data. For this reason, this tutorial will focus on methods that you can use outside the Platform to perform a more thorough analysis and further tune the performance.

Target audience: Data scientists and Developers


This tutorial requires that you are familiar with using Python and Jupyter notebook.

The problem - Improving a “perfect” model 

In deep learning projects, datasets are often split into three different subsets, training, validation, and test.

Training subset

A training subset is used to fit the weights of a model to minimize the training loss. A low training loss value means that the model can make accurate predictions on the training subset, but it does not mean that it will generalize well to new data.

Validation subset

In contrast to the training dataset, the validation subset does not have any impact on the weights. When this data is passed through the model, you will get calculated loss and other metrics that will help you to tune the architecture of your model, for instance, the number and type of blocks and the settings that are available within these blocks. Since the validation subset is not used directly to fit the weights, the resulting loss from using this data is a better measurement of how well the model will generalize.

Test subset

Since the validation data still plays a part in the tuning of the model and helping you to ensure that it is not overfitting, the evaluation metrics may give an overly optimistic view of how the model will generalize on real data. For this reason, it is considered good practice to reserve a part of the dataset for final testing only, i.e., the test subset. 

100% accurate?

The model that you build in this tutorial will result in validation metrics that indicate that the model is 100% accurate. However, evaluation of the deployed model, using the test dataset, reveals that further improvements can still be achieved.

By using a Jupyter notebook, you will identify the worst misclassified samples and the classes with the lowest recall. With this information, you can then amend the training dataset to improve the performance further.

The fruit-360 dataset

The fruit-360 dataset contains 71,125 RGB labeled images, depicting one of 103 types of fruit. The raw data images are stored in folders that are named according to class, i.e., the name of the fruit.

The images have the dimensions 100 x 100 pixels. You will keep the original dimensions, but some preprocessing of the dataset will still be needed before the dataset can be uploaded to the Platform. Most importantly, you will need to create a zip file that is structured in a supported format.

Create a dataset zip before uploading it

Before you can upload the dataset to the Platform, you first need to preprocess it. 

  1. Download and extract this zip file containing the Fruit-360 dataset.
  2. Download and this Jupyter notebook and start it. Note that you must either clone or download the entire repository, or save the file in raw format with the extension ipynb.
    $ jupyter notebook fruit_preprocessing.ipynb
  3. Update the input- and output path in the notebook.
  4. Run the notebook.

The notebook will perform the following tasks:

  • Load the raw image paths into the image column of a Pandas dataframe. 
  • Create a column that holds the class of each image. This information is retrieved from the folder name that holds the image.
  • Check that the images have the same number of channels. A mix of, e.g., grayscale and RGB images will cause errors in the Platform during training.
  • Create a second dataframe with a reduced number of images in one of the classes (“Apple Granny Smith”).
  • Ensure that all preprocessed images have the same dimensions (100 x 100 pixels).
  • Create datasets based on the two dataframes that can be uploaded to the Platform.

Upload the reduced dataset to the Platform

  1. Create a project and name it, so you know what kind of project it is.
  2. Navigate to the Datasets view and click New dataset.
  3. Locate the zip file that contains the reduced dataset (
  4. Drag and drop the file to the area under Upload data.
  5. Click Next.
  6. Name the dataset.
  7. Click Done.

Datasets view

Subsets of the Fruits-360 dataset

In the top right corner, you’ll see the subsets. All samples in the dataset are by default split into 20% validation and 80% training subsets. Keep these default values in this project.

Why use the default random split?

In cases where you want to perform an off-platform analysis of a model based on the validation subset, it can be useful to create a split based on a conditional filter. For instance, you may want to create a column in the dataset that indicates if a sample should belong to the training or the validation subset. Using the same rule both on and off the Platform ensures that the training subset samples are not included in your analysis.

The raw dataset for this tutorial has a separate test subset that you will use for the off-platform analysis and should not be uploaded to the Platform. Consequently, the random split will work just as well as one based on a specific condition.

Save the dataset

You’ve now created a dataset ready to be used in the platform. Click Save version and navigate to the Modeling view.

Design a pretrained model

The ImageNet dataset contains a large number of images of fruits. For this reason, you are likely to quickly get very good results using a pretrained snippet with weights from ImageNet.

Adding a pretrained snippet

  1. In the Modeling view, click New experiment. Name the model and click Create.
  2. Select VGG16 in the Inspector. A dialog will open where you can choose the weights and if the blocks in the snippet should be trainable.
  3. Keep the default settings and click Create. The VGG16 blocks will be added as two collapsed groups to the canvas, VGG16 and Head. You can expand and collapse the group at any time by clicking +/-. An Input block and a Target block will also be added.
  4. Set Feature to image in the Input block.

VGG16 snippet

Replacing the Head

You cannot replace or add blocks within a group but you can replace an entire group with individual blocks. You will do this next since the Flatten block within the Head group should be replaced with a 2D Global average pooling block for flattening the data. Using this block works better in most cases and particularly when the input do not have the dimensions 224 x 224 pixels (the size of the images in the ImageNet dataset).

  1. Select the Head group and delete it.
  2. Add a 2D Global average pooling block.
  3. Add three Dense blocks.  
  4. Add a Target block.
  5. Set Nodes to 103 (the number of classes in the dataset) and Activation to Softmax in the last Dense block.
  6. Set Feature to fruit_class and Loss to Categorical crossentropy in the Target block.
  7. Before you run the experiment make sure to make a copy of the experiment. To do so click Duplicate in the Experiment options menu. The reason for this is that you will run a near-identical experiment but with a different dataset later on in the tutorial. The pretrained weights will be included in the copy.

The model

Running the first experiment

Click the original version of the experiment in the Experiment options menu.

Train the Dense blocks of the model for ten epochs with the Adam optimizer. Everything should be set up correctly, so just click Run

Analyzing the first experiment

Platform evaluation

Go to the Evaluation view. Since the model solves a classification problem, a confusion matrix is displayed. The top-left to bottom-right diagonal shows correct predictions. Everything outside this diagonal are errors. This may be surprising, but the metrics and confusion matrix will indicate that the model is 100% accurate!

Model evaluation - Confusion matrix

Model evaluation - Error graph

Does this mean that the model performs just as well on the test data? To find out, you will need to compare the predictions of the deployed model with ground truth, i.e., the labels in the test data. You can make this comparison off-platform using a Jupyter notebook.

Note that when you get results that are as perfect as this, it is usually a good reason to be suspicious since it may indicate that there is a problem with the dataset:

  • Identical images may appear in the training- and the validation subset. This will give an incorrect view of how well the model generalizes to new data.
  • The weights in the VGG16 snippet were set based on the input images. In the case of this tutorial, it could mean that images are actually included in ImageNet. Again, this will skew the results to make the model perform better than in reality.

Off-platform evaluation with Sidekick

  1. Go to the Deployment view and create a new deployment based on the best epoch of the last experiment. Make sure that the deployment is enabled.
  2. Download this notebook and start it.
    $ jupyter notebook fruit_analysis.ipynb
  3. Install Sidekick and any other required Python packages.
  4. Update the path your dataset, URL, and token for your deployment.
  5. Run the notebook.

The notebook will demonstrate how to:

  • Use Sidekick to get predictions
  • List the worst misclassified samples, i.e., wrong predictions with a high level of confidence.
  • Calculate the precision, recall, f1-score, and contribution of each individual class
  • Calculate the overall accuracy
Why did it go wrong?

The notebook reveals that the model, although it is performing well on the test data, is not perfect. Among the worst misclassified samples you will find several instances of Apple Granny Smith. 

The exact number may vary between different experiments.

Misclassified samples

According to the classification report, Apple Granny Smith is also one of the classes with the lowest recall.

Classification report

The main reason for the low recall is obvious in this case. The dataset was manipulated to contain fewer images in the Apple Granny Smith class!

The notebook reveals two typical misclassifications. Samples often tend to be predicted to be Pear or Apple Golden 3.

Left: Pear, center: Apple Granny Smith, right: Apple Golden 3

In a real-life production scenario, the solution to the problem would be to add more images, resembling those that you have just identified. For this tutorial, we simply need to train our model again, but with the complete dataset that contains all the images.

Before proceeding, make a note of the overall accuracy that is calculated in the notebook.

Time to improve - Run a second experiment

To confirm that the missing images in the reduced dataset caused the lower recall in the analysis, run a second experiment on the complete dataset,

  1. Locate the zip file that contains the complete dataset (
  2. Upload the dataset to the Platform and save it.
  3. Go to the Modeling view and select the copy of your experiment.
  4. Click Settings and change the dataset to the one that you have just created. 
  5. Set Feature to image in the Input block.
  6. Set Feature to fruit_class in the Target block.
  7. Click Run.

Analyze the second experiment

Platform evaluation

Go to the Evaluation view. As could be expected, the model is still 100% accurate on the validation data. Now let’s see if it also performs better on the training data and the class Apple Granny Smith in particular.

Off-platform evaluation

Update the URL and token for your new deployment and run the analysis notebook again.

You will notice that the recall has improved by some amount. However, the change in the dataset is too small to have a significant impact on the overall accuracy of the model. 

Tutorial recap

You have prepared a dataset for the Peltarion Platform and used it to train a model to the point where further tuning of the architecture is not feasible. 

By deploying the model and comparing the predictions of the model with the ground truth in a separate test dataset, you have done two things:

  • identified the class that would benefit most from more training data 
  • improved model performance metrics for the identified class