Data preprocessing means transforming the raw data into a useful form that the deep learning models can use.
To preprocess your data, there are different methods that you can apply. While some of the preprocessing methods are applicable on-platform, others should be applied off-platform before importing your data. Keep in mind that different methods can be useful for different use cases and data types.
Regardless if you’re using on- or off-platform solutions, there are some general steps you can follow to preprocess your data:
Apply the changes on whole dataset
Make sure that you apply the changes not only to your training data but also to your test and validation data.
Have a good domain knowledge
You need to have a good domain knowledge of your use case and dataset. Most of the improvements of data are connected with the domain knowledge, such as selecting the right features depending on the domain knowledge.
Explore your dataset
To preprocess your data, you first have to explore your data and try to understand it. Look for anomalies in your dataset. Reason your feature selection and assess if you select the right dataset for your use case.
Watch out for inaccuracies
Once you have your dataset ready, you should watch out for inaccuracies in your dataset. Check if your dataset involves correct information.
Watch out for imbalanced data
You should watch out for imbalanced data. It is not necessarily a bad thing to have an imbalanced dataset, but it is important to be aware of the reasons you have an imbalanced dataset and how to deal with it.
Handle missing values
Missing values is a very common issue that needs to be identified and handled. Missing values are oftentimes the result of systemic issues and which approach to use depends very much on the data and the task.
Watch out for outliers
Outliers are values that differ drastically from the rest of the values in a dataset feature. Outliers are not necessarily bad, but in some cases, you might want to remove them.
Watch out for data leakage
Data leakage means that the model has access to specific data at a stage that it shouldn’t. For example, data leakage occurs when there are duplicated rows both in the training set and the validation set. Since you are using the same rows both in training and validation sets, data leaks. The solution for this would simply be to remove duplicates.
Reduce the noise
Noise is the information contained in the data that’s not relevant to modeling the relationships between the input data and the target one wishes to learn. Noise makes it difficult for deep learning models to process raw data. It is, therefore, very important to clean the data from the noise as much as possible.
Watch out for bias
Bias is when a model’s predictions are skewed and wrongly favor certain things or people. You should watch out for and avoid different types of bias.
Read more about data preprocessing
There are different data preprocessing methods you can use for different types of data, both on- and off-platform. Here are the in-depth articles on how to preprocess your dataset: