Text encoding

Text encoding allows to use text written in natural language, i.e., one or more sentences that would normally be spoken or written.

Input block features that are encoded as Text can be connected to different blocks, depending on whether you want to use BERT, XLM-R, or the Universal sentence encoder as your Natural Language Processing (NLP) model.

How long should text features be?

If a feature contains textual keywords or tags, e.g., high, medium, beach, mountain, then the Categorical encoding is likely a better encoding choice.

Text encoding is most powerful when processing complete sentences, including grammatical constructions.

Different types of text processing models have different upper limits on the text length:

  • Models using one of the BERT encoders support up to 512 tokens, roughly 300 to 500 words.
    You can set the exact Sequence length you want in the BERT Tokenizer block when you design your model.

  • Models using the Universal sentence encoder don’t have a limit on text feature length.

What languages are supported?

The languages supported depend on the type of natural language processing model that you use:

Working with many languages

Multilingual models like the Multilingual BERT cased and USE Embedding snippets work with any of their respective known languages.

This means that you can:

  • Mix examples from different languages in the same training dataset.

  • Fine-tune your models with data in languages that are easily available, but use it for predictions from any other language.

Multilingual training dataset and predictions
Figure 1. Example of sentiment classification. The training data combines examples from English and French which are easily available. The model predicts the sentiment of a sentence in any language.
Was this page helpful?