Text data preprocessing

Text data represents words, sentences, paragraphs, or any free flow text such as customer feedback, book paragraphs, chat messages, and tweets.

Text data often occurs in an unstructured format, and it includes a lot of noise which makes it difficult for deep learning models to process raw text data. It is, therefore, very important to clean text data from the noise and transform the unstructured text data into a more structured format.

To preprocess your data, there are different methods that you can apply. While some of the preprocessing methods are applicable on the platform, others should be applied before importing your data.

Different methods can be useful for different use cases. Make sure that you apply the changes not only to your training data but also to your test and validation data.

Off-platform text data preprocessing

There are a lot of useful methods to apply, you should choose the methods that suit your use case.

The method you choose to preprocess your text data is up to you, you can, for example, use Python libraries to apply these methods in your text data.

Some common steps that you can follow to preprocess your text data is:

  • Fix the typos
    It is very common to have spelling mistakes in text data. Fixing those mistakes is a very simple and effective way to reduce the noise.

  • Remove the duplicates
    Sometimes, there might be duplicates in data and they can be removed. Duplicates can occur in a word such as 'seeee' and also sentences 'go go go go'.

  • Remove the bullet point symbols
    Bullet point symbols can be removed and splitted into separate sentences instead.

  • Remove the multiple spaces
    Multiple spaces are often unnecessary and don’t help the model to learn so they can be removed.

  • Remove the punctuations
    Depending on the use case, punctuations can be removed. In tasks such as text classification and similarity models, punctuations can be removed since they have no use.
    Example: There is no meaningful difference for a model between 'I love pink.' and 'I love pink!'. Therefore the punctutation can be removed.

  • Remove special characters
    Often, social media datasets have hashtags or other special characters such as @, <>, -, _, -, +, =, `, ~.
    Those characters do not add any value to the understanding of algorithms and they increase the noise in text data. Therefore, they can be removed. However, if you want to classify tweets, make sure to not remove the words following the hashtag symbols.

  • Remove newline characters
    Newline characters and tabs ("\n", "\t") can be removed.

  • Remove special UTF characters
    UTF characters often starting with &0 and displayed to the user like ı, ç, ö and similar. Special UTF character can also be removed.

  • Remove emojis and emoticons
    Emoticon (emotion and icon) is a combination of characters to express a feeling or tone, such as a smile :-) .
    Removing emojis and emoticons can often help to reduce noise. However it is not always the right decision such as in sentiment analysis. In those cases, emojis can be replaced with words.

  • Remove html and url
    Html tags can be in the text when collecting data from the raw internet pages. If the text within the tags does not provide value, it should be removed. If the text do provide value, you should extract the texts from within the tags.

  • Stemming
    Stemming is reducing the words into its roots such as 'going' to 'go'. Although stemming can be useful in machine learning models, it is not as important for deep learning models. There is also a risk of removal of more than we want.

  • Lemmatization
    Lemmatization means to transform the word into its root form, such as 'better' to 'good'.
    The difference between lemmatization and stemming is that while stemming turns the word into its roots with cutting the suffixes ('going' to 'go'), lemmatization transforms the word into its base form ('better' to 'good').

  • Lower casing
    Algorithms may treat words with lower and upper cases (Book and book) differently. When the use case is not case sensitive, lower casing might help to improve the models performance.

Find patterns in your dataset

In addition to these steps, you can also always look at the data, find patterns and see how those patterns look and if you should take any action on those, such as removing them.

Machine learning models learn from patterns, and if the patterns are not what your model should learn, then it can create bias in data. Therefore it is very important to look at the data, try to see some patterns and remove them.

Example: In an image classification model to classify different flowers, the model was very good at predicting snowdrop flowers. However, the model was good not only because of the flower itself, but also the snow in the background in the images. All the images with snowdrop flowers included snow in the background, and the model learned a pattern that it shouldn’t learn.

What method to use

Most of these methods are useful methods to apply, but some more than others. For instance, if you have a model that is trained to detect specific words or sentences, then stemming may remove those words. Similar to lower casing, some words, such as name of the people, public holidays, company names may often be case sensitive.

Reason being that these language models need to represent the words somehow, so instead of storing all words you could have, it breaks them down into tokens. Therefore, when a model sees a word: “wordlessness”, it might have the tokens "word", "less", "ne", "ss".

The fewer tokens the model needs to represent each word, the stronger connection it can build to how words are related. What most of these techniques do is to remove uncommon, misspelled, and unknown characters/tokens for the model, so that it can learn to better represent all words as tokens.

For rare words the model was not trained on, such as Peltarion, it might represent it as tokens: "Pel", "ta", "r", "ion".

Also, if you allow the model to include Capital letters, then you will have new tokens for when it sees words with large and small letters: Peltarion could be tokenized differently from peltarion.

For more details on the subject, you can check this blogpost where our colleague Markus explains why we need tokenizers.

On-platform text data preprocessing

In the platform, you can change the sequence length settings. Sequence length is the number of tokens (roughly the number of words) kept during text tokenization.

If there are more tokens in an example than sequence length the text will be truncated, i.e., cut from the end to fit the sequence length.
If there are fewer tokens the text will be padded. Smaller sequences compute faster but you might lose information.

Sequence length
Was this page helpful?