Outlier handling

Outliers are values that differ drastically from the rest of the values in a dataset feature.

When you have outliers in your data, there are two options, either you remove them, or you don’t.

Inspect first
As a first step, it is a good idea to manually inspect the outlier data and understand why they differ so much from the rest of your data.

  1. In the Datasets view, navigate to the Outliers tab in the Data cleaning tab.

  2. Inspect the different features of your dataset.
    Every column in the graph indicates how many samples that have that value.

Outlier handling
Figure 1. Outliers selected

Outlier handling is turned off by default but can be turned on and off individually for each feature.

When are the outliers bad

Training a model with data that includes outliers might hinder the model from learning from the relevant data in the dataset.

If there are a lot of outliers or if the values of the outliers differ significantly from the rest of the data, they can heavily influence how well the model will perform. This may result in your model making worse predictions.

When you cannot justify why the outliers are in your dataset, and they shouldn’t be present in the data from the start, then you should remove them.

An example of such an outlier is an invalid age number such as -1 or 1000.

Remember that you should try to keep as much valid data as you can, since the more data you have, the bigger chance is that your model will make good predictions.

When are the outliers good

You should keep in mind that outliers do not always indicate errors in data, or extreme values that need to be removed. They might also be part of edge cases that you shouldn’t remove.

If you were to remove an edge case outlier, your model’s accuracy might improve in your current dataset, but might perform very poorly if the deployed model is used close to the edge case.

If you have edge case outliers and want to use your deployed model in the edge case range, you might need to collect more data, or look for other solutions instead of ignoring or removing them.

An example of an edge case outlier is a valid age number such as 100 or 110, of which you have too few samples in your data.

Why does a dataset include outliers

Outliers are not uncommon, and there are many reasons why a dataset might contain them, for example:

  • Data entry errors (human errors)

  • Measurement errors (instrument errors)

  • Sampling errors (extracting or mixing data from wrong or various sources)

  • Intentional outliers (dummy outliers created for testing)

  • Edge or rare cases (samples that should be part of the data, but you have too few of them)

How to work with outlier handling

Often it’s clear where the outliers are. For example, a feature got a spike at the end of a histogram or a really long tail of spread-out examples.

You can also compare the results of your model with and without outliers so that you can see how the results change.

Remove outliers only

Don’t remove too many examples. The more data you have, the bigger chance it is that your model will learn from your data and make good predictions.

Values are removed from the whole dataset

A sample can have an outlier value in one feature but have values in a realistic range for other features. If you remove an outlier, the complete sample is removed. That is, you will also remove the other feature values for this sample.

Procedure to remove outliers

  1. In the Datasets view, navigate to the Outliers tab in the Data cleaning tab.

  2. Inspect the different features of your dataset.
    Every column in the graph indicates how many samples that have that value. If there is a graph that includes outliers click Select a range. Select a range button

  3. Select Min and Max cut-off value.
    Either by dragging the sliders in the graph or by entering values in the boxes.
    All values above max and below min are considered an outlier.

  4. Click Apply changes.
    This will remove the sample that includes the outlier value from the whole dataset version. Apply changes button

You’ll see how many samples that remain in the dataset at the top of the Datasets view. It’s the same value as in the total number of samples in the subset pane in the Datasets overview page.

CSV and numeric only

Outlier handling is limited to csv-files and numeric features at the moment. A zip file that contains a csv with numeric features will also work.

Where to find the outlier handling feature on the platform

Outlier handling is available in the Data cleaning tab in the Datasets view.

Was this page helpful?
Yes No