Subset of a dataset

Create subsets for training, validating, and testing.

A subset allows you to reduce the size of your dataset, and to split the examples into disjoint sets for the purpose of training, validation, and testing.

It’s important that these sets are disjoint, meaning that the same example cannot belong to more than one set. Training a model with the same examples that are in the validation or test subsets is the AI equivalent of giving the same question during class and at the final exam: the model could get an amazing grade simply by memorizing one answer, rather than by learning how to calculate it.

We always create a training set, a validation set, and a test set by default, containing 80%, 10%, and 10% of the examples in the imported dataset.

How to create subsets

You can create more subsets by clicking the New subset button. A dialog will appear to help you create subsets in 2 steps:

  1. Filter data lets you define which examples to keep or remove from the original dataset.
    If that’s all you need, you can click Create subset. Otherwise, go to the next step.

  2. Split subset lets you split the filtered examples into two or three disjoint subsets for the purpose of training, validation, and test (if required).

Filter data

Filtering data can be used to keep or exclude examples according to custom criteria.

If you don’t create any filter, the entire dataset is used when splitting into subsets.
The Data usage slider shows how much of the dataset remains after filtering.

You may use filtering when:

  • Your dataset is very large and you want to use a smaller subset to train your model faster.
    There is an example of it in the Predicting mood from raw audio data tutorial.

  • You want to work with only some part of your dataset.
    For instance, you can use filters to remove a specific category, or to keep only target values that are above a certain value.

Add random % filter

A random % filter selects examples at random, keeping only the percentage of data specified by the Size.

In practice, each example in the dataset gets a random number between 0 and 100. Only the examples with a value between Start and End are preserved by the filter, and the others are removed.

Random filter
  • Fixed from: Toggle whether you want to specify the Start or the End of the included range.

  • Start/End: The value that defines one side of the included range. The other side of the range is always 0% or 100%.

  • Size: The size of the included range, that is, the percentage of examples to keep.

Add conditional filter

With the conditional filter, examples are included or excluded based on the value of a feature.

  • Feature: The feature to consider.

  • Operator: The operator to compare feature values, e.g., Is equal to, Text contains, and Less than…​

  • Value: Value that each example is compared to using the operator.

The order of filters is important

The examples are filtered sequentially with the first filter, then the second filter, then the third filter, etc.
This means that the first filter will act on the entire dataset, but the subsequent filters will only apply to what remains. In some cases, a filter might remove too much data.

Example:
Your dataset consists of images of many fruits, but you want to only keep half of the Banana examples.
If you filter half your examples first, then filter to keep only bananas, you could end up with a very different result compared to filtering to first keep only bananas, then filter half the examples.

The difference will be stronger the more unbalanced the dataset is, or the fewer examples it contains.

Filter ordering
Figure 1. By chance, no Banana examples remain when filtering out half the dataset randomly first. Changing the order of the filters allow to get the desired result.

Move filters

You can use the arrows to rearrange your filters.

Move filter in a subset
Figure 2. Use the arrows and the cross to reorder or remove filters.

Split subset

Splitting a subset creates 2 or 3 disjoint subsets of examples that can be used for training, validation, and testing.

You can specify custom sizes for each subset, in percentage of the data remaining after the filter step.
The sum of sizes for all the subsets can’t be greater than 100, but it can be smaller if you want to work with less data, e.g., to experiment faster.

There are 3 Types of split: Stratified, Random, and Categorical.

Stratified split

Stratified sampling for splitting subsets
Figure 3. Dataset split using the fruit name as the feature to stratify on. This ensures that the categories of fruit have the same distribution in the original dataset and in the split subsets.

The Stratified type of split randomly assigns examples to the training, validation, or (optionally) test subsets, in the proportions given by the Size of each subset.

However, special treatment is also taken to ensure that the Feature to stratify on has the same distribution of categories in the original dataset and in each split subset.
Only categorical features can be used to stratify a split.

Unlike the Random type of split, Stratified guarantees that you always evaluate your model on a distribution of examples that is representative of your dataset.

  • Feature to stratify on: The feature whose category distribution will be preserved. Only categorical features are available.

  • Size (percentage): The size of the subset, given in percentage of the data that remains after the Filter step.

  • Name: The name you want to give to the split subset.

Random split

Random sampling for splitting subsets
Figure 4. Dataset split completely at random. By chance, the validation subset has no example of banana. This will give misleading results since the model won’t be evaluated on this category.

The Random type of split randomly assigns examples to the training, validation, or (optionally) test subsets, in the proportions given by the Size of each subset.

In most cases, Random is the best way to split subsets.

However, when the dataset is heavily unbalanced, or when there are only few examples, some of the split subsets might end up with a distribution of examples that is not representative of your dataset. This could lead to misleading results, and should be avoided.
In such cases, try to use the Stratified or Categorical splits instead.

  • Size (percentage): The size of the subset, given in percentage of the data that remains after the Filter step.

  • Name: The name you want to give to the split subset.

Categorical split

Categorical sampling for splitting subsets based on a category
Figure 5. Dataset split using an extra categorical feature. You can decide which category belongs to which subset. In this cases, the category Train is assigned to the training subset, and Val is assigned to the validation subset.

The Categorical type of split uses the categories from a categorical feature to assign examples to a subset.

This allows you to decide exactly how to split examples by marking them with an additional feature in the dataset that you upload.

  • Feature referring to training and validation: The feature whose categories are used to assign examples to specific subsets. Only categorical features are available.

  • Category: The category that marks membership to the specific split subset.

  • Name: The name you want to give to the split subset.

Was this page helpful?
YesNo