Universal sentence encoder

The multilingual universal sentence encoder is a model that can process text in 16 languages and produce embeddings that are suitable for semantic text similarity tasks.
The embeddings produced work best with long text features, as opposed to keywords or short sentences that are better encoded as Categorical features.

  • The Universal sentence encoder block trains and runs faster than the Multilingual BERT encoder block, especially for longer text.

  • The Universal sentence encoder is fine-tuned for text similarity, allowing you to deploy the model without requiring any training.

  • The Universal sentence encoder supports only 16 languages, less than are supported by the Multilingual BERT encoder block.

Note
Disclaimer
Please note that datasets, machine-learning models, weights, topologies, research papers and other content, including open source software, (collectively referred to as “Content”) provided and/or suggested by Peltarion for use in the Platform and otherwise, may be subject to separate third party terms of use or license terms. You are solely responsible for complying with the applicable terms. Peltarion makes no representations or warranties about Content. You expressly relieve us from any and all liability, loss or risk arising (directly or indirectly) from Your use of any third party content.

Using the Universal sentence encoder

Input

The input of the Universal sentence encoder must come directly from an Input block.

The input feature must be text, written in natural language, using the Encoding Text.

The Universal sentence encoder supports 16 languages that can be mixed freely.

Output

The Universal sentence encoder returns a 512-component embedding that encodes the input text. You can use this output to compare different text examples, for example for text similarity and text clustering tasks.

Training

The Universal sentence encoder block is initialized with weights pretrained on the Stanford Natural Language Inference (SNLI) corpus.

This means that you don’t have to fine-tune the Universal sentence encoder for similarity tasks. Simply make sure that all the blocks in your model graph have the Trainable setting un-checked to skip training and save time.

Languages supported

Unlike the Multilingual BERT encoder block which supports over 100 languages languages, the universal encoder block supports only 16 languages.

The Universal sentence encoder supports these languages:

Arabic

French

Korean

Spanish

Chinese-simplified

German

Dutch

Thai

Chinese-traditional

Italian

Polish

Turkish

English

Japanese

Portuguese

Russian

Parameters

Trainable: Whether or not the block weights are updated during training. Since the Universal sentence encoder block is pretrained to output values that are directly usable for text similarity, Trainable is deactivated by default.

Available weights

The Universal sentence encoder block uses the universal-sentence-encoder-multilingual model with weights released by Google.

The weights are pretrained on SNLI.

Terms

When using pretrained blocks, additional terms apply: universal sentence encoder with weights licence.

References

Was this page helpful?
YesNo