Impact of standardization - create different versions of a dataset / Example workflow

'Shall I standardize my data or not?' is one of the frequently asked questions in machine learning. In this example workflow, you will learn how to evaluate the effect of standardization on some features. To achieve this, different versions (with and without standardization) of a dataset need to be created. The dataset Predict California house prices will be used in this workflow(here).

Step 1: Create a new project

Click on New project in the Projects view.

Name the project, add a description and click Create.

Step 2: Import data

Click Open project, select Estimate house price from the tutorial dataset list.

Step 3: Rename a feature with a meaningful label

Rename medianHouseValue to Target_medianHouseValue. This will help you to identify the target feature in the Modeling view easily.

Step 4: Manage data encoding

Click each feature-column of the data, if the feature values are categorical, select Categorical option from the Encoding dropdown list.

Example: a categorical-encoding usage can be found in the MNIST tutorial (here). In this workflow, the Encoding for the feature image_path should be Image and the Encoding for other features should be Numeric.

Step 5: Create five versions of the dataset

1st version - NoStdImage/TargetStd

  1. Select None as the Normalization option for the feature image_path.

  2. Select Standardization as the Normalization option for the feature Target_medianHouseValue.

  3. Name this version NoStdImage/TargetStd which is short for No standardization on image / Target standardized.

2nd version - StdImage/TargetStd

  1. Duplicate dataset NoStdImage/TargetStd, name it StdImage/TargetStd which is short for Standardized image / Target standardized.

  2. Select Standardization as the Normalization option for the feature Target_medianHouseValue.

3d version - StdTabular/TargetStd

  1. Duplicate dataset StdImage/TargetStd, name it StdTabular/TargetStd which is short for Standardized tabular data / Target standardized.

  2. Create a new feature set that only contains tabular data and name it input_tablular.

  3. Select Standardization as the Normalization option for each feature in the newly created feature set input_tablular.

4th version - NoStdTabular/TargetStd

  1. Duplicate dataset StdTabular/TargetStd, name it NoStdTabular/TargetStd which is short for No standardization on tabular data / Target standardized.

  2. Select None as the Normalization option for each feature in the newly created feature set input_tablular.

5th version - NoStdTabular/NoTargetStd

  1. Duplicate dataset NoStdTabular/TargetStd, name it NoStdTabular/NoTargetStd which is short for No standardization on tabular data / No standardization on Target.

  2. Select None as the Normalization option for the target feature Target_medianHouseValue.

Conclusion

The platform allows data examination, data type adaptation, data normalization, new feature set creation and new subset creation. Next, build models in the modeling view.