Product development /

Natural language processing capabilities are here!

September 4 2019/6 min read

Big news! We’re very excited to be launching a new beta feature for text classification, aka sentiment analysis support. This is the first of many natural language processing (NLP) capabilities we’ll be adding to the Peltarion Platform.

Text is arguably the most widespread data type among high-potential deep learning use cases, and being able to build deep learning models for text data opens up many more opportunities for our users. Transfer learning from language models enables training a good, generalizable model with a much smaller amount of task-specific examples than would be required to train a model from scratch. 

Transfer learning has been possible with text data for many years, but the “new wave” of language models such as ULMFiT and BERT vastly outperform those methods over a broad set of use cases, including classification, semantic text similarity, question answering and named entity recognition. We will keep adding capabilities as we go ⁠— so stay tuned! 

Eager to get started? Try our step-by-step tutorial to learn how to solve an NLP text classification problem on the platform.

Datasets view

What does this mean for the future of building deep learning models on the platform?

Text encoding

Now, you can upload sentences and paragraphs of text to the platform in your CSV file. For better parsing support, before uploading the CSV file to the platform, make sure the text in the text feature column is double-quoted. For example: “This is a sentence for the ‘text feature’ example used in the CSV!”

The rest of CSV rules apply as before. For example, all columns should be well formatted and shouldn’t contain empty or NULL values.

Before saving the dataset version for text classification, you should set the language model you want the text feature to be encoded with. This is needed to produce the tokens for each text sample. Currently, we provide English, Swedish and Finnish language models, and even punctuation is tokenized. The Sequence length parameter enables you to define the maximum number of words used per sample of the text feature. If there are more words, we shorten the sample by capping the last words, leaving them out. If there are fewer words in the sample than the expected maximum length, we pad the sample with zeros in the end, when encoding the words into integer vectors.

Build a text classifier

Different from images, the text is a one-dimensional data type, and to make processing text easier, we have now also added 1D-convolution and pooling capabilities to the model builder. 

When using text as the input feature, make sure to add the new Text embedding (beta) block before starting with the convolution filters. On the Text embedding (beta) block, you define how your input text is tokenized before feeding it to the neural network. You can use randomly initialized embedding or one of the FastText CC-licensed pre-trained embeddings.

Model building-mode!

Testing the text classifier

To make it easier to quickly test your model on new sentences, we have provided a Text Classifier web app. Define your deployment URL, token and input parameter, enter a new text and press ▶️ to predict!

Testing over cURL works as for any other deployment. Remember, help is always available in our Knowledge center!

Not yet a Community member? Sign up and get started here

  • Ele-Kaja Gildemann

    Ele-Kaja Gildemann

    Product Owner

    Ele-Kaja Gildemann is a Product Owner at Peltarion. She has a degree in computer science from Tallinn University of Technology and more than 15 years of experience in sectors as diverse as digital services, telecom and retail. She is passionate about data-driven product development, user experience and machine learning.

02/ More on Product development