Working with multiple dataset versions

Easy experimenting is a key feature of the Peltarion Platform. You can experiment and optimize your AI-models in many ways, one way is to use dataset versions to test different ways to configure your dataset.

When you create a dataset, the first version comes with default settings, e.g., the default subset split is in an 80% training subset and a 20% validation subset. But other ways of splitting may work better.

Basically, you want to train your model on as many samples as possible, the more samples the model sees the better it becomes. If you have a relatively large dataset it might make sense to split the dataset into a 95% training and a 5% validation subset. If the dataset isn’t large enough, however, the validation dataset will be small and might not give you a reliable estimate of the generalization error.

If your dataset is very large and you want to experiment with your model, one idea is to train and validate on smaller subsets, e.g., an 8% training and a 2% validation subset. Smaller subsets mean faster training times, so using smaller subsets will speed up your experimenting and allow you to test new model ideas faster. This is what we do in our tutorial Predicting mood from raw audio data.

Another case where you want to test different splits is if you want to train or validate on data with a specific feature value.

Example: You have a dataset with sales data from different cities in Sweden and you want to train and validate on only the data from Stockholm. Then it is easy to create subsets with only Stockholm sales data.

Another way to experiment with the dataset settings is by changing the encoding setting on a feature.

Example: In some cases, it is not easy to know whether you should normalize inputs (e.g. images) by standardization, min-max scaling, or not at all. If you have images that are quite off from the other images then the standardization will be off as well. Then it can be a good idea to create two dataset versions where you standardize the images in one version and not in the other version. With this setup, you can test which version generates the best results.

Naming of dataset versions

Remember to use a good naming strategy for your versions. “Version 1”, “Version 2”, “Version 3” etc., becomes a little hard to decipher after a while.

How to check subset settings of a saved dataset version

To check the subset settings of a saved dataset version, hover over the subset and you will see a tooltip that’ll show the configuration.

How to edit a dataset version

You can only edit unsaved versions of a dataset. As soon as you save a version you’ll lock this version for editing.

Create a new version of a dataset

To create a new version of a dataset you have three options:

  • Click Duplicate in the top right corner of the screen Duplicate button.

  • Click the three dots ThreeDotsButton_A next to the dataset version you want to duplicate and select Duplicate. This menu is called the Dataset options menu.

  • Click Duplicate latest version in the Datasets navigation section Duplicate latest version button.

Try the platform