Text similarity / cheat sheet
- Target audience: Data scientists and developers
When to use this cheat sheet
Use if your project aims to find similar pieces of text.
What is text similarity?
Text similarity is a way to quantify how similar 2 pieces of text are.
You do this by converting each piece of text in your dataset to vectors with a deep learning model. These vectors are saved to an index.
When you want to find text similar to a new piece of text, you run the new text through the same model. Then you compare the new text piece’s vector to all dataset texts in the index to find the most similar texts.
Text similarity search is fast since the text is reduced to a small amount of numerical values.
Use the Experiment wizard to build your experiment. Snippets are pre-built neural network architectures available on the platform.
In the Snippet tab select Text similarity as Problem type drop-down menu. Then select the USE Embedding snippet (if your data is one of the 16 languages supported by USE otherwise you should use XLM-R Embedding).
The Universal Sentence Encoder snippet will now show up on the Modeling canvas.
Everything is preset here. Click Run to start the training.
Since experimenting is so easy on the platform, you could create and run a new experiment now but use the XLM-R Embedding snippet instead. Then you can see which one gives the best result on your dataset.
XLM-R is a multilingual model that can be use with data in any of these 100 languages.
Skip the Evaluation view. It is not needed for a quick implementation of text similarity.
Create new deployment
In the Deployment view, click New deployment and select Similarity search. Select:
Epoch 0 for Checkpoint
Sentence embedding as Output feature.
Finally, click Create. Now all the text pieces in the dataset pass through the model once, and the platform builds the index.
Click Enable to make the deployment available for REST API calls.