Text similarity / cheat sheet
- Target audience: Data scientists and developers
When to use this cheat sheet
Use if your project aims to find similar pieces of text.
What is text similarity?
Text similarity is a way to quantify how similar 2 pieces of text are.
-
You do this by converting each piece of text in your dataset to vectors with a deep learning model. These vectors are saved to an index.
-
When you want to find text similar to a new piece of text, you run the new text through the same model. Then you compare the new text piece’s vector to all dataset texts in the index to find the most similar texts.

Text similarity search is fast since the text is reduced to a small amount of numerical values.
Data preparation
Import your data in the Datasets view; directly from your local computer, via the Data API, from a URL, or from your data warehouse.
Nothing special needs to be done for this problem type in the Dataset view, just click Save dataset and then Use in new experiment.

Modeling
Use the Experiment wizard to build your experiment. Snippets are pre-built neural network architectures available on the platform.
In the Snippet tab select Text similarity as Problem type drop-down menu. Then select the USE Embedding snippet (if your data is one of the 16 languages supported by USE otherwise you should use XLM-R Embedding).
The Universal Sentence Encoder snippet will now show up on the Modeling canvas.
Everything is preset here. Click Run to start the training.

Concurrent experiment
Since experimenting is so easy on the platform, you could create and run a new experiment now but use the XLM-R Embedding snippet instead. Then you can see which one gives the best result on your dataset.
XLM-R is a multilingual model that can be use with data in any of these 100 languages.
Evaluation
Skip the Evaluation view. It is not needed for a quick implementation of text similarity.
Deployment
Create new deployment
In the Deployment view, click New deployment and select Similarity search. Select:
-
Your experiment
-
Epoch 0 for Checkpoint
-
Sentence embedding as Output feature.
Finally, click Create. Now all the text pieces in the dataset pass through the model once, and the platform builds the index.
