Feature distribution

Feature Distribution

The distribution of a feature over its range, with value on the horizontal axis and frequency on the vertical axis.

For string and categorical encoded integers, the feature distribution is displayed as a bar chart with the highest, i.e., the label with the highest count, on the left side.

For float, float tensor, not categorical encoded integer and image the feature distribution is displayed as a histogram. This is done by dividing the range into a number of bins and calculate how many data points fall within the boundaries of each bin.

If you hover over a bin you’ll see the count and the range for that bin.

Understanding the feature distribution

The feature distribution helps in understanding what kind of feature you are dealing with, and what values you can expect this feature to have. You’ll see if the values are centered or scattered.

Why is the distribution important? Because deep learning models learn from data. Give them incorrect data and they will learn incorrectly. A deep learning model is only as good as the data you feed it.

You can also see if the training and validation subsets differ if they look pretty much the same that’s good.

Normal distribution

If a feature can be seen as a random variable, and enough data is used and the bins are narrow enough, the look of the distribution could be bell-shaped. This is called a normal distribution (or standard/Gaussian). The normal distribution is a very important probability distribution that arises in many situations. Generally speaking, when you have a large number of independent samples from a naturally occurring phenomenon, the data will follow a normal distribution.

Groups of data might form lumps a little off to one side. Such lumps can be of special significance and worth having a further look at.

Erroneous data

By looking at the distribution you can detect erroneous data such as outliers, these are single data points laying very far away from the rest of the data.

If you have outliers or other erroneous data in the dataset you could try to normalize it or create a new subset to filter out the data. If this doesn’t work you should remove the outliers outside the Platform and import the dataset again.

Subset statistics

The statistics are approximations based on a sample of the dataset. The sample size is adaptive and increases until the variance of the mean is a fraction of the range of the values.

  • Mean: Global mean

  • Min: Minimum value

  • Max: Maximum value

  • Unique values: Number of unique elements of the subset. Estimated with the HyperLogLog method.

  • Sample size: Number of examples in the subset.

  • StDev: Standard deviation of the subset

Get started for free