Text encoding

Text encoding allows you to use text written in natural language, i.e., one or more sentences that would normally be spoken or written.

Input block features that are encoded as Text can be connected to different language blocks, depending on whether you want to use BERT, Sentence XLM-R, or the Universal sentence encoder as your Natural Language Processing (NLP) model.

How long should text features be?

If a feature contains textual keywords or tags, e.g., high, medium, beach, mountain, then the Categorical encoding is likely a better encoding choice.

Text encoding is most powerful when processing complete sentences, including grammatical constructions.

Different language processing models have different upper limits on the text length that they can process:

  • Models using one of the BERT blocks support up to 512 tokens, roughly 300 to 500 words.

  • Models using the Universal sentence encoder don’t have a limit on text feature length.

You can set the exact Sequence length you want to use in the block parameters when you design your model.

What languages are supported?

The languages supported depend on the type of natural language processing model that you use:

Working with many languages

Multilingual models like the Multilingual BERT and Universal sentence encoder work with any of their respective known languages.

This means that you can:

  • Mix examples from different languages in the same training dataset.

  • Fine-tune your models with data in languages that are easily available, but use it for predictions from any other language.

Multilingual training dataset and predictions
Figure 1. Example of sentiment classification. The training data combines examples from English and French which are easily available. The model predicts the sentiment of a sentence in any language.
Was this page helpful?
Yes No