Universal sentence encoder
The multilingual universal sentence encoder is a model that can process text in 16 languages and produce embeddings that are suitable for semantic text similarity tasks.
The embeddings produced work best with long text features, as opposed to keywords or short sentences that are better encoded as Categorical features.
-
The Universal sentence encoder is fine-tuned for text similarity, allowing you to deploy the model without requiring any training.
-
The Universal sentence encoder block runs faster than the Sentence XLM-R block, especially for longer text.
-
The Universal sentence encoder supports only 16 languages, less than the 100 languages supported by the Sentence XLM-R block.
Note
|
Disclaimer Please note that datasets, machine-learning models, weights, topologies, research papers and other content, including open source software, (collectively referred to as “Content”) provided and/or suggested by Peltarion for use in the Platform and otherwise, may be subject to separate third party terms of use or license terms. You are solely responsible for complying with the applicable terms. Peltarion makes no representations or warranties about Content. You expressly relieve us from any and all liability, loss or risk arising (directly or indirectly) from Your use of any third party content. |
Using the Universal sentence encoder
Use cases
The Universal sentence encoder is a good model for processing text written naturally and in many languages, like the Multilingual BERT block.
However, the Universal sentence encoder was pretrained more specifically for sentence embedding, making it a better choice for text similarity tasks out of the box.
Text similarity tasks can also be performed by the slower but more powerful Sentence XLM-R block.
Input
The input of the Universal sentence encoder must come directly from an Input block that provides a text encoded feature.
The Universal sentence encoder supports 16 languages that can be mixed freely.
Output
Parameters
Trainable: Whether we want the training algorithm to change the value of the weights during training. In some cases, one will want to keep parts of the network static.
Since the Universal sentence encoder block is pretrained to output values that are directly usable for text similarity, Trainable is deactivated by default.
Training the Universal sentence encoder
The Universal sentence encoder block is initialized with weights pretrained on the Stanford Natural Language Inference (SNLI) corpus.
This means that you don’t have to fine-tune the Universal sentence encoder for similarity tasks. Simply make sure that all the blocks in your model graph have the Trainable setting un-checked to skip training and save time.
Languages
Why use a multilingual model?
A multilingual model allows you to deploy a single model able to work with any of the 16 known languages.
More than a simple convenience, multilingual models often perform better than monolingual models.
One reason is that the training data available is generally more limited in any single language.
In addition, many languages share common patterns that the model can pick up more easily when it is trained with a variety of languages.
Languages supported
Unlike the Sentence XLM-R block which supports 100 languages, the universal encoder block supports only 16 languages.
The Universal sentence encoder supports these languages:
Arabic |
French |
Korean |
Spanish |
Chinese-simplified |
German |
Dutch |
Thai |
Chinese-traditional |
Italian |
Polish |
Turkish |
English |
Japanese |
Portuguese |
Russian |
Parameters
Trainable: Whether or not the block weights are updated during training.
Available weights
The Universal sentence encoder block uses the universal-sentence-encoder-multilingual model with weights released by Google.
The weights are pretrained on SNLI.
Terms
When using pretrained blocks, additional terms apply: universal sentence encoder with weights licence.
References
-
Yinfei Yang, Daniel Cer, Amin Ahmad, Mandy Guo, Jax Law, Noah Constant, Gustavo Hernandez Abrego , Steve Yuan, Chris Tar, Yun-hsuan Sung, Ray Kurzweil. Multilingual Universal Sentence Encoder for Semantic Retrieval. July 2019
-
Muthuraman Chidambaram, Yinfei Yang, Daniel Cer, Steve Yuan, Yun-Hsuan Sung, Brian Strope, Ray Kurzweil. Learning Cross-Lingual Sentence Representations via a Multi-task Dual-Encoder Model. To appear, Repl4NLP@ACL, July 2019.