Universal sentence encoder

The multilingual universal sentence encoder is a model that can process text in 16 languages and produce embeddings that are suitable for semantic text similarity tasks.

The embeddings produced work best with long text features, as opposed to keywords or short sentences that are better encoded as Categorical features.

  • The Universal sentence encoder is fine-tuned for text similarity, allowing you to deploy the model without requiring any training.

  • The Universal sentence encoder block runs faster than the Sentence XLM-R block, especially for longer text.

  • The Universal sentence encoder supports only 16 languages, less than the 100 languages supported by the Sentence XLM-R block.

Note
Disclaimer
Please note that datasets, machine-learning models, weights, topologies, research papers and other content, including open source software, (collectively referred to as “Content”) provided and/or suggested by Peltarion for use in the Platform and otherwise, may be subject to separate third party terms of use or license terms. You are solely responsible for complying with the applicable terms. Peltarion makes no representations or warranties about Content. You expressly relieve us from any and all liability, loss or risk arising (directly or indirectly) from Your use of any third party content.

Using the Universal sentence encoder

Use cases

The Universal sentence encoder is a good model for processing text written naturally and in many languages, like the Multilingual BERT block.
However, the Universal sentence encoder was pretrained more specifically for sentence embedding, making it a better choice for text similarity tasks out of the box. Text similarity tasks can also be performed by the slower but more powerful Sentence XLM-R block.

Input

The input of the Universal sentence encoder must come directly from an Input block that provides a text encoded feature.

The Universal sentence encoder supports 16 languages that can be mixed freely.

Output

The Universal sentence encoder returns a 512-component embedding that encodes the input text. You can use this output to compare different text examples, e.g., for text similarity and text clustering tasks.

Parameters

Trainable: Whether we want the training algorithm to change the value of the weights during training. In some cases, one will want to keep parts of the network static.

Since the Universal sentence encoder block is pretrained to output values that are directly usable for text similarity, Trainable is deactivated by default.

Training the Universal sentence encoder

The Universal sentence encoder block is initialized with weights pretrained on the Stanford Natural Language Inference (SNLI) corpus.

This means that you don’t have to fine-tune the Universal sentence encoder for similarity tasks. Simply make sure that all the blocks in your model graph have the Trainable setting un-checked to skip training and save time.

Languages

Why use a multilingual model?

A multilingual model allows you to deploy a single model able to work with any of the 16 known languages.

Multilingual training dataset and predictions
Figure 1. Example of sentiment classification. The training data combines examples from English and French which are easily available. The model predicts the sentiment of a sentence in any language.

More than a simple convenience, multilingual models often perform better than monolingual models.
One reason is that the training data available is generally more limited in any single language. In addition, many languages share common patterns that the model can pick up more easily when it is trained with a variety of languages.

Languages supported

Unlike the Sentence XLM-R block which supports 100 languages, the universal encoder block supports only 16 languages.

The Universal sentence encoder supports these languages:

Arabic

French

Korean

Spanish

Chinese-simplified

German

Dutch

Thai

Chinese-traditional

Italian

Polish

Turkish

English

Japanese

Portuguese

Russian

Parameters

Trainable: Whether or not the block weights are updated during training.

Available weights

The Universal sentence encoder block uses the universal-sentence-encoder-multilingual model with weights released by Google.

The weights are pretrained on SNLI.

Terms

When using pretrained blocks, additional terms apply: universal sentence encoder with weights licence.

References

Was this page helpful?
YesNo