Data for deep learning

How much data do I need? It depends!
Maybe you should ask yourself; How much data do I need to do What?

Figuring out how much data you need is not an exact science or number (unfortunately), but below is a framework of how you could think when building your dataset.

Complexity

You need to figure out how complex your data is if the data covers everything it needs to cover and if you got enough data for all features in your data.

Think about the following questions:

  • How complex are my classes or features?
    Example: Binary classification is simple. Multi-label classification needs more data.

  • How independent are the labels to each other?
    Are there overlap, or do the features depend on each other? If there is an overlap, the model will struggle and will need more data.
    Example: Black or white is simple. 50 shades of grey need more data.

Coverage

Think about the following questions:

  • What percentage of the total population is represented in this data?
    Beware of biased data.

  • Do you need to add or remove data? The platform gives you some tools to handle this.

  • If your data is old, you might need to ask yourself if the data is still relevant?

  • Does the data cover what the model will see in the real world?
    Your data might work really well in the lab, but does it reflect the real world?
    Example: You’ve collected training data in Germany and want to use the trained model in Brazil.

Don’t use the model here
Figure 1. Use the model to make predictions where you have data (the blue dots).

Count

It’s only in prepared datasets that you have exactly the same amount of samples from each label. The real world is never exactly the same.

Think about the following questions:

  • Are all classes well represented in this data?

  • Do you have enough samples for each class?

Example: If your dataset consists of 1000 rows but all of them have the same label, then the model will fail, and you won’t succeed. The platform gives you some tools to handle this, e.g., you can use class weights on the Target block.

Imbalanced dataset
Figure 2. Imbalanced dataset where most of the examples are positive.

Next step - Peltarion workflow

Was this page helpful?
YesNo