USE - Text similarity / cheat sheet

Target audience: Data scientists and developers

The universal sentence encoder, or USE for short, is a text processing model that can encode sentences from 16 different languages. What’s more, the version offered on the Peltarion Platform is pre-trained for sentence similarity tasks.

What does it mean?

It means that the USE can read a sentence, and produce numerical values that are almost equal to the values obtained from a similar sentence, allowing easy comparison. And we’re talking similar as in they really have the same meaning, not just matching keywords.

This is super practical if you have large amounts of text data that you need to search through, like emails, reviews, abstracts, famous quotes, etc.

It works across 16 languages and you don’t need to train it.

How to use it?

To search through piles of text with the speed and accuracy of a mountain goat, just follow these steps:

  1. Build and deploy a USE model on the Peltarion Platform

  2. Use the model to index all the text data you want to be able to search

  3. Run a search request

We’ll see how to do steps 2 and 3 in Python until the platform does it for you.

Build a USE model

Log in to the Pletarion Platform, and create a new project.

We’ll need some data to help define what the model should do, but we won’t train the model so the actual data doesn’t really matter.

Open the Data library and import a text dataset, for example the Multilingual book corpus.

Create a new experiment, and build a model using the universal sentence encoder.

Model based on the universal sentence encoder
Figure 1. Model using the universal sentence encoder. The target block is only used for a possible training, but the model will return the value of the Output block when queried.

Make sure to:

  • Use a text encoded feature as input

  • Add an output block after the USE block

  • The target feature doesn’t matter, but the model graph should be valid

  • Deselect Use for predicitons in the Target block settings

  • Deselect Trainable for all blocks (the platform will warn that no block is trainable)

Index your text data

What we’ll do next is to index the data that you want to search through. It might take a bit of time depending on how much data you have, but the good side is that you only do it once.
For the developers out there, that’s like creating a hash table but using deep learning.

The index will associate two parts:

  • The original sentence from your data (or a link to find it)

  • The USE encoding of that sentence

Gather your text data

To build an index, we first need to list all the sentences that we want to index. pandas is a nice way to work with big tables in Python, so let’s build a pandas dataframe from a list of sentences, and save it for later use:

import pandas as pd
sentences = pd.DataFrame()
sentences['quote'] = list_of_quotes
sentences['author'] = list_of_authors
sentences.head()

sentences.to_csv('my_index_of_quotes.csv')

>
	quote	                                            author
0	I'm selfish, impatient and a little insecure. ...	Marilyn Monroe
1	You've gotta dance like there's nobody watchin...	William W. Purkey
2	You know you're in love when you can't fall as...	Dr. Seuss
3	A friend is someone who knows all about you an...	Elbert Hubbard
4	Darkness cannot drive out darkness: only light...	Martin Luther King Jr., A Testament of Hope: T...

Encode the text data

Now let’s use sidekick to get the encodings from the deployed model. You can install sidekick with

pip install git+https://github.com/Peltarion/sidekick#egg=sidekick

We’ll need to copy the URL and [.guilabel]t#oken# from the API information into our Python script. We’ll also save the encodings for later comparison.

import sidekick
import numpy as np

client = sidekick.Deployment(url=<URL>, token=<token>)
payload = [{'sentence': s} for s in sentences['quote']]
encodings = client.predict_many(payload)[0]['Sentence embedding']
np.save('encodings.npy', encodings)

Search a sentence

Finally, we can search our index for sentences similar to any new request.

Encode the request

We’ll use sidekick again to get the encoding of a new sentence:

new_sentence = 'le chef doit se concentrer sur l'objectif'
new_embedding = client.predict_many([{'sentence': new_sentence}]])[0]['Sentence embedding']

Compare encodings

Then we can compare very quickly the new encoding with all of the encodings known in the index using the cosine similarity.

def similarity_score(encodings, new_encoding):
    scores = np.matmul(encodings, new_encoding.transpose())
    return scores

View the most similar results

We can view the most similar results by picking sentences with the highest similarity in the index.

print('Most similar quotes:')
top_idx = score_list.argsort()[-3:][::-1]
for idx in top_idx:
    print(f'{scores[idx]:.4f} \t {sentences['quote'][idx]}')

>
Most similar quotes:
0.5849	The leader must be focused on the goal.
0.5429	Principles keeps a man focused and head straight to achieve a set goal
0.5300	Focus is not a state of single-mindedly pursuing one goal or objective without getting distracted. Rather, it’s a commitment to exert all concentration and effort through constant adjustment like the lens of the human eye on a vision or purpose
Was this page helpful?
YesNo