Text similarity / cheat sheet

Person - Target audience: Data scientists and developers

When to use this cheat sheet

Use if your project aims to find similar pieces of text.

What is text similarity?

Text similarity is a way to quantify how similar 2 pieces of text are.

  1. You do this by converting each piece of text in your dataset to vectors with a deep learning model. These vectors are saved to an index.

  2. When you want to find text similar to a new piece of text, you run the new text through the same model. Then you compare the new text piece’s vector to all dataset texts in the index to find the most similar texts.

Convert each piece of text in your dataset to vectors with a deep learning model. These vectors are saved to an index. When you want to find text similar to a new piece of text

Text similarity search is fast since the text is reduced to a small amount of numerical values.


Data preparation

Import your data in the Datasets view; directly from your local computer, via the Data API, from a URL, or from your data warehouse.

Nothing special needs to be done for this problem type in the Dataset view, just click Save dataset and then Use in new experiment.

Save version and Use in new experiment

Modeling

Use the Experiment wizard to build your experiment. Snippets are pre-built neural network architectures available on the platform.

In the Snippet tab select Text similarity as Problem type drop-down menu. Then select the USE Embedding snippet (if your data is one of the 16 languages supported by USE otherwise you should use XLM-R Embedding).

The Universal Sentence Encoder snippet will now show up on the Modeling canvas.

Everything is preset here. Click Run to start the training.

Run button

Concurrent experiment

Since experimenting is so easy on the platform, you could create and run a new experiment now but use the XLM-R Embedding snippet instead. Then you can see which one gives the best result on your dataset.

XLM-R is a multilingual model that can be use with data in any of these 100 languages.


Evaluation

Skip the Evaluation view. It is not needed for a quick implementation of text similarity.


Deployment

Create new deployment

In the Deployment view, click New deployment and select Similarity search. Select:

  • Your experiment

  • Epoch 0 for Checkpoint

  • Sentence embedding as Output feature.

Finally, click Create. Now all the text pieces in the dataset pass through the model once, and the platform builds the index.

Create new deployment to build index
Figure 1. Click Create in the New deployment pop-up to build the index.

Enable deployment

Click Enable to make the deployment available for REST API calls.

Enable button
Was this page helpful?
YesNo