Subset of a dataset

Subset for training and validating

A subset enables you to define a set of examples in the dataset to be used in a specific phase of the modeling process, i.e. training or validation. One example is one row in the dataset matrix.

When importing the datasets to the platform, we create a default 80 % training and 20 % validation data subset for you by shuffling and partitioning the examples randomly.

Use subset to narrow down a large dataset

You can create subsets if you want to narrow down a dataset. This is useful when your dataset is very large and you want to use a smaller dataset to train your model. We do this in the tutorial predicting mood from raw audio data.

Dataset subsets
Figure 1. Dataset subsets

Create a subset

Note
When you partition a dataset you want each example to belong to one and only one subset.

RANDOM, % RULE SUBSET

With the Random % rule subset, examples in the dataset are included according to these parameters.

  • Seed: Random seed you like, used for initializing the shuffling of data. Every time you use the same seed, this will generate a subset with the same random examples.

  • Operator: e.g., Is equal to, Greater than, and Less than.

  • Value: Part of the dataset that should be included in the subset in relation to the operator.

Example: To create a 70% subset, you can use the basic conditions Seed: 1111, Operator: Less than, and Value: 0.7.

CONDITION RULE SUBSET

With the condition rule, the examples are included in the subset according to these parameters.

  • Feature: Feature in the dataset.

  • Operator: e.g., Is equal to, Text contains, and Less than.

  • Value: Part of the dataset that should be included in the subset in relation to the operator.

Example:: You have a dataset with data on Swedish cities. To see how your trained model is capable to generalize on unseen data you can use conditions:

  • Feature: City, Operator: Is not equal to, Value: Stockholm for the training data subset.

  • Feature: City, Operator: Is equal to, Value: Stockholm for the validation data subset.

COMBINATION SUBSET OF RANDOM AND CONDITION RULES

You can create a subset of examples by combining a number of different conditions.

Note
Each example must belong to one and only one subset. Be careful with the rule definitions.

Example: You create non-overlapping training and validation datasets with nested conditions by setting different random rules.

Your training subset could be defined with these rules:

  • Random:

    Seed: 1111, Operator: Less than, Value: 0.7. (70% of the examples)

  • Condition:

    Feature: City, Operator: Is equal to, Value: Stockholm
    AND

  • Condition:

    Feature: Date, Operator: Greater than or equal to, Value: 20170101

For the validation subset you want to make sure you don’t include the same examples. Use this exclusive rule combination:

  • Random:

    Seed: 1111, Operator: Greater than or equal to, Value: 0.7. (30% of the examples)

  • Condition:

    Feature: City, Operator: Is equal to, Value: Stockholm
    AND

  • Condition:

    Feature: Date, Operator: Greater than or equal to, Value: 20170101

Get started for free