Subset of a dataset

Create subsets for training and validating

A subset enables you to define a set of examples in the dataset to be used in a specific phase of the modeling process, i.e., training or validation. One example is one row in the dataset matrix.

When importing the datasets to the platform, we create a default 80 % training and 20 % validation data subset for you by shuffling and partitioning the examples randomly.

Narrow down a large dataset

You can create subsets if you want to narrow down a dataset. This is useful when your dataset is very large and you want to use a smaller dataset to train your model. You can find out how in the tutorial - Predicting mood from raw audio data.

Note
When you partition a dataset you want each example to belong to one and only one subset.

Random % filter

With a random % filter, a set percentage of random rows of the dataset are included in the subset.
Each row in the dataset gets a random number from 0 to 1 based on the Seed. Then rows are filterd out based on the Size % from the Start or the End.

Random filter PA1
  • Start/End: Toggle if you want take the examples from the start, i.e. Less than (<), or the end, i.e., Greater than (>) of the dataset.

  • Size: Size of the dataset to filter out. E.g., 20% for a validation subset.

  • Seed: Random seed you like, used for initializing the shuffling of data. Every time you use the same seed, this will generate a subset with the same random examples.

Example:
To create a 70% subset, you can use the basic conditions Seed: 1111 and Size: 70%. Just keep in mind that you want each example to belong to one and only one subset.

Conditional filter

With the conditional filter, examples are included in relation to the value of a feature.

  • Feature: Feature in the dataset.

  • Operator: e.g., Is equal to, Text contains, and Less than.

  • Value: Part of the dataset that should be included in the subset in relation to the operator.

Example:
You have a dataset with data on Swedish cities. To see how your trained model is capable of generalizing on unseen data, you can use conditions:

  • Feature: City, Operator: Is not equal to, Value: Stockholm for the training data subset.

  • Feature: City, Operator: Is equal to, Value: Stockholm for the validation data subset.

Combination subset with random % and conditional filters

You can create a subset of examples by combining several different filters.

Each example must belong to one and only one subset. Be careful with the filter definitions.

Example:
You create non-overlapping training and validation datasets with nested filters by setting different random % filters.

Order of filters may be important for small datasets.

When your dataset is small, the order of random % filters is important. As it works, the subset is created sequentially with the first filter, then the second filter, then the third filter, etc. If you are unlucky, the first filters may remove all examples of one category, thus creating an unbalanced subset.
For conditional filters, the order doesn’t matter.

Example:
Your small dataset consists of images of bananas and pears. It includes many more images of bananas than pears. If you apply several random % filters, there is a high chance that we pick only bananas and your resulting subset only consists of bananas. Not ideal!

By chance a the seed 1111 results in a subset with only bananas. With seed 5555 it consists of both.
Figure 1. By chance the seed 1111 results in a subset with only bananas. With seed 5555 it consists of both.

Move filter in a subset

You can use the arrows to rearrange your subset filters.

Move filter in a subset
Figure 2. Move filter in a subset

Example of non-overlapping filters

When you split a dataset in a training and a validation subset each example must belong to one and only one subset.

In this example, we filter out from a larger dataset all who moved to the city Stockholm 20170101 or later. These examples are split into a training (70%) and a validation (30%) subset.

Training (70%) subset

Your training subset could be defined with these filters:

  • Random %:

    Seed: 1111, Operator: Less than, Value: 0.7. (70% of the examples)

  • Condition:

    Feature: City, Operator: Is equal to, Value: Stockholm
    AND

  • Condition:

    Feature: Date, Operator: Greater than or equal to, Value: 20170101

Validation (30%) subset

For the validation subset you want to make sure you don’t include the same examples. Use this exclusive filter combination:

  • Random %:

    Seed: 1111, Operator: Greater than or equal to, Value: 0.7. (30% of the examples)

  • Condition:

    Feature: City, Operator: Is equal to, Value: Stockholm
    AND

  • Condition:

    Feature: Date, Operator: Greater than or equal to, Value: 20170101

Get started for free