Outliers are values that differ drastically from the rest of the values in a dataset feature.
Outliers indicate that something is wrong, and they shouldn’t be present in the data from the start. Checking for outliers is a sanity check on your data.
You should remove outliers but keep as much data as you can since the more data you have, the bigger chance is it that your model will make good predictions.
Outlier handling is turned off by default but can be turned on and off individually for each feature.
Outliers will result in worse predictions
Training a model with data that includes outliers will hinder the model from learning from the relevant data in the dataset.
If there are a lot of outliers or if the outliers are really off, they can heavily influence how well the model will train. This will result in that your model will make worse predictions.
Why does a dataset include outliers
Outliers are not uncommon, and there are many reasons why a dataset might contain them, for example:
Data entry errors (human errors)
Measurement errors (instrument errors)
Sampling errors (extracting or mixing data from wrong or various sources)
How to work with outlier handling
Often it’s clear where the outliers are. For example, a feature got a spike at the end of a histogram or a really long tail of spread-out examples.
Remove outliers only
Don’t remove too many examples. The more data you have, the bigger chance it is that your model will learn from your data and make good predictions.
Values are removed from the whole dataset
A sample can have an outlier value in one feature but have values in a realistic range for other features. If you remove an outlier, the complete sample is removed. That is, you will also remove the other feature values for this sample.
Procedure to remove outliers
In the Datasets view, navigate to the Outliers tab in the Data cleaning tab.
Inspect the different features of your dataset.
Every column in the graph indicates how many samples that have that value. If there is a graph that includes outliers click Select a range.
Select Min and Max cut-off value.
Either by dragging the sliders in the graph or by entering values in the boxes.
All values above max and below min are considered an outlier.
Click Apply changes.
This will remove the sample that includes the outlier value from the whole dataset version.
You’ll see how many samples that remain in the dataset at the top of the Datasets view. It’s the same value as in the total number of samples in the subset pane in the Datasets overview page.
CSV and numeric only
Outlier handling is limited to csv-files and numeric features at the moment. A zip file that contains a csv with numeric features will also work.
Where to find the outlier handling feature on the platform
Outlier handling is available in the Data cleaning tab in the Datasets view.