Tabular data preprocessing
Tabular data is displayed in tables with columns and rows. In tabular data, each column is a feature. The features are used as inputs or target to the deep learning model.
Which feature encodings to use?
The feature encoding determines the way data is turned into numbers. Different encodings will give different representations of the data. In the platform, there are five encodings: categorical, binary, numeric, image, and text.
When working with multiple inputs in tabular data, you can only use features with categorical, binary, and numeric encoding. Therefore, you have to select the features with these encodings.
Tabular data exploration
To preprocess your tabular data, you first have to explore your data and try to understand it. Choosing the right features for your dataset is not always straightforward.
Have a good domain knowledge
To be able to select the right features, you need to have a good understanding of the overall content of your data and have good domain knowledge of your use case and dataset.
For instance, when you are predicting the revenue of a shop, what should be the input features that impact the revenue as output? The number of units sold? Price of the products, store names, location of the stores, or something similar?
Watch out for inaccuracies
Once you have your dataset ready, you should watch out for inaccuracies in your dataset, such as having
longitudein one column. You should also handle inconsistencies in your dataset.
For example, you should have one value representing the same gender, don’t use both
Watch out for data leakage
Data leakage means that the model has access to specific data at a stage that it shouldn’t. In the context of tabular data, this could be duplicate rows existing both in the training set and the validation set. The solution for this would simply be to remove duplicate rows.
Data leakage can also happen if one column functions as a cheat sheet for what you want to predict.
Example: You are trying to classify whether a store will have more than 50 customers on a daily basis. Your target feature in the dataset is a column with a binary feature indicating whether this is true or not (1/0).
Then if one of your input features is a column containing the total number of customers on a daily basis. This is like a cheat sheet for the model and you would get data leakage. The model would simply focus on the number of customers column when making predictions and likely ignore other, more reasonable columns.
Once you have manually inspected your dataset, it can be a good idea to preprocess your data.
Off-platform tabular data preprocessing
These are off-platform methods that you can use to preprocess your data before importing your dataset to the platform. The method you choose to preprocess your tabular data is up to you, you can for example use python libraries to apply the following methods.
To explore your dataset, you can start by visualizing your data and looking at the distribution of the features and labels. Watch out for imbalanced data, outliers, and anomalies in the dataset.
Study each feature separately and look at the relationship between different features. It is also a good idea to make a scatter plot between four features and your labels to see if there are any clear trends. If there are no clear trends, this could mean that your dataset is harder for the model to solve and could need some feature engineering.
Inspect statistical values
Look for statistical values like variance, which indicates how much variation there is for the values in the feature. If the variance is zero, it means all values within that feature are the same. If it is the case, then there is no clear pattern that the deep learning model learns and you should exclude that feature from your dataset.
Handling missing values
Missing values is a very common issue that needs to be identified and handled. It can be handled in different ways:
Remove the entire row
This can be done both on and off the platform.
Replace missing values with mean, median or mode
This is one of the most common methods to replace the missing values and avoid significant changes. Common practice is to replace missing numeric values with the column mean and categorical values with column mode.
Replace the missing values with new populated values
For example, if the values in your dataset have a range between 0 to 100, you can generate numbers within that range and replace them with missing values.
Replace the missing values with zero
For example, in a dataset with lactose-free milk consumption by cafes, replacing the missing values with zero might be useful with the assumption that lactose-free milk hasn’t been consumed in those cafes. However, it needs to be careful in cases where 0 has a significant value for the model. Also, you should ensure that the label is something that collides with other labels (s.a. -1, 9999).
Replace the missing values with new values
This is sometimes done for categorical values, where missing values are replaced with a new “unknown” category. Missing values are oftentimes the result of systemic issues and which approach to use depends very much on the data and the task. Domain expertise for the specific problem is very valuable when deciding how to deal with missing values.
During data exploration, you should also watch out for inconsistencies in data. For example, a dataset where salary is populated in the location column. There can also be duplicated data, incorrect spellings, and much more. Inconsistencies should be removed.
By default, the Peltarion Platform uses class weights in order to deal with imbalanced datasets. In some cases this is sufficient, however, in cases where the dataset needs to be balanced, some off-platform approaches you can follow are: augmenting the dataset, over-sampling (duplicates the samples from the minority class), under-sampling (removing samples from the majority class) and try to acquire more data.
Binning your data
The application of binning means reducing the number targets/values your model has to learn. You group the dataset examples into groups of similar values.
Binning might help to improve the model performance by reducing the effects of small observation errors and reducing the noise in your data.
Binning also changes your problem from predicting exact value, to predicting which bin this falls under. In some cases this might be something you want. However, in stock market prediction, that may be too imprecise.
Example: An income dataset can be binned into 3 bins with incomes of 0-20000, 20000-40000 SEK, 40000-60000 SEK. By binning, you reduce the number of house prices from possibly 60000 values (0-50000) into 3 values.
On-platform tabular data preprocessing
Once you import your dataset, the platform will display the dataset view. Depending on how you want to handle your data, you can follow any number of steps presented in the dataset view.
Deselect irrelevant or redundant features in the Experiment wizard
The dataset’s features are the columns in the dataset matrix. The features are used in the Input or Target block in the Modeling view.
You can deselect irrelevant or redundant features in the Experiment wizard. By removing some of the features and only leaving the essential ones, your model becomes less complex and trains faster. This is also where you would exclude columns that could lead to data leakage.
Create feature set to narrow down the dataset
You can create feature sets by combining different features with the same encoding. This is useful when you want to narrow down a massive dataset with many features.
Select the right feature encoding
In the dataset view, you can look at the dataset features, and ensure that you select the right feature encodings.
For example, when you have a finite amount of store names in the store name column, you should choose categorical encoding instead of text encoding.
Distinction between integer and float numbers: It is worth noting that there is a distinction between integer and float numbers. The distinction is important because the platform interprets a column of integers as a categorical feature, which might be undesirable if the feature describes e.g. age.
Look at feature distribution
Above each feature, you can see the feature distribution, label, shape, and count. The feature distribution helps you to understand the kind of feature you are dealing with and what values you can expect this feature to have.
By looking at the distribution of a feature/column, you can detect how common certain values are in your data and if your data is imbalanced.
One of the tabs in the dataset view is the features tab. In the features tab, you can see the samples per class, number of unique values, most common values, mean value, and standard deviation.
Check for imbalanced data
You should watch out for imbalanced data. An imbalanced dataset occurs when the class distribution is not balanced. For example, in an image classification problem with defective and non-defective products, if 90% of the products are non-defective, then this dataset is imbalanced.
It is not necessarily a bad thing to have an imbalanced dataset, but it is important to be aware of the reasons you have an imbalanced dataset and how to deal with it. By default, the Peltarion Platform uses class weights in order to deal with imbalanced datasets.
Normalize your data
Remove rows with missing values
If you upload a csv file with empty values or malformed rows, the platform will help you clean the data set. The cleaning tool in the Data cleaning tab can help you remove rows with empty values in your input data.
Outliers are values that differ drastically from the rest of the values in a dataset feature. When you have outliers in your data and you want to remove them, you can remove the outliers from the outliers tab.
Rephrasing your problem
One very common approach when, e.g., dealing with tricky regression problems is to rephrase it to a simpler classification problem.
Example: If you are trying to predict the revenue of a store, however, what you are truly interested in is whether the revenue is above a certain threshold. In this case, it might be worth it to introduce a new binary column where the classes indicate: above threshold or not. This column could then be used as the target for a new classification problem, potentially resulting in more desirable results.