Big news! We’re very excited to be launching a new beta feature for text classification, aka sentiment analysis support. This is the first of many natural language processing (NLP) capabilities we’ll be adding to the Peltarion Platform.
Natural language processing capabilities are here!
Text is arguably the most widespread data type among high-potential deep learning use cases, and being able to build deep learning models for text data opens up many more opportunities for our users. Transfer learning from language models enables training a good, generalizable model with a much smaller amount of task-specific examples than would be required to train a model from scratch.
Transfer learning has been possible with text data for many years, but the “new wave” of language models such as ULMFiT and BERT vastly outperform those methods over a broad set of use cases, including classification, semantic text similarity, question answering and named entity recognition. We will keep adding capabilities as we go — so stay tuned!
Eager to get started? Try our step-by-step tutorial to learn how to solve an NLP text classification problem on the platform.
What does this mean for the future of building deep learning models on the platform?
Now, you can upload sentences and paragraphs of text to the platform in your CSV file. For better parsing support, before uploading the CSV file to the platform, make sure the text in the text feature column is double-quoted. For example: “This is a sentence for the ‘text feature’ example used in the CSV!”
The rest of CSV rules apply as before. For example, all columns should be well formatted and shouldn’t contain empty or NULL values.
Before saving the dataset version for text classification, you should set the language model you want the text feature to be encoded with. This is needed to produce the tokens for each text sample. Currently, we provide English, Swedish and Finnish language models, and even punctuation is tokenized. The Sequence length parameter enables you to define the maximum number of words used per sample of the text feature. If there are more words, we shorten the sample by capping the last words, leaving them out. If there are fewer words in the sample than the expected maximum length, we pad the sample with zeros in the end, when encoding the words into integer vectors.
Build a text classifier
Different from images, the text is a one-dimensional data type, and to make processing text easier, we have now also added 1D-convolution and pooling capabilities to the model builder.
When using text as the input feature, make sure to add the new Text embedding (beta) block before starting with the convolution filters. On the Text embedding (beta) block, you define how your input text is tokenized before feeding it to the neural network. You can use randomly initialized embedding or one of the FastText CC-licensed pre-trained embeddings.