Data science /

Search text by semantic similarity

March 22/3 min read
  • Romain Futrzynski
    Romain FutrzynskiSenior Application Engineer

Images and videos may take up a lot of space on the Internet but with 300 billion emails, half a billion tweets, and over 3 billion Google searches made each day, text is still a big player in digital life.
Let's see how deep learning can help you navigate through endless lines of prose.

02/ Crash course in NLP

Natural Language Processing (NLP) is the field of data science that deals with handling text written in a natural way, as opposed to text written in a programming language or in a SQL query.

NLP starts with the tokenization and embedding of text into high-dimensional numerical vectors. Modern tokenization is much more efficient in creating units of information compared to a word-by-word or character-by-character split, and it can often be applied across many languages and alphabets. Text embedding converts these tokens into vectors that have the remarkable property to let mathematical operations reflect their natural meaning.

An NLP model can use these embeddings to solve a task, like identifying the mood of the text, its genre, or its similarity with another piece of text. The state-of-the-art models are transformers that use attention, like BERT and XLM-R, but other models like the Universal Sentence Encoder (USE) can get satisfying results with shorter computation times.

03/ What is text similarity?

When text is converted into embedding vectors, an NLP model can perform numerical operations on them to find out various kinds of information. In the case of text similarity, the model is trained not to output a category or a score, but a new embedding vector that encodes the entire input text.

Full models like the USE Embedding snippet and the XLM-R Embedding snippet (SBERT-nli-stsb) are pretrained specifically so that when we compare two of their sentence embeddings by cosine similarity, the result indicates how similar the two input sentences are.

A model encodes natural text as a high-dimensional vector of values. Calculating the cosine similarity between these vectors gives the semantic similarity between different texts.

There are several advantages to using deep learning for searching through text.

It's lightning-fast

We’ll concede that some models can be on the slow side when training. However once deployed, it takes figuratively no time to encode a new sentence and to compare it to thousands of indexed entries. 

Unlike comparing strings character by character, the cosine similarity is computed with pure vectorized math operations. Not only is that super fast, but it also doesn’t get slower when you work with longer texts.

It’s forgiving

Since the text comparison doesn’t happen at the character level, the comparison can be a lot more tolerant to some differences between your search query and the indexed data.

  • Tolerance to spelling
  • Tolerance to synonyms
  • Tolerance to language
  • Tolerance to style

At the basic level, variations in spelling (like upper case, accents, plurals, or typos) are greatly mitigated when the text is tokenized and converted into vectors. 

Then the full power of working in a vector space comes into action.
Synonyms that may look very different when typed--and may even be from different languages--will end up as different yet nearby points in the embedding space. 

Even when two sentences use different words, different styles, and different languages, their encoded forms are nearby in the vector space that NLP models work with.

The same benefits apply at the sentence level, so that two sentences written with different words and different styles end up as nearby points in the final sentence embedding, letting you search text by meaning rather than by keyword.

04/ How does the Peltarion Platform find similar text?

You start by uploading a dataset containing a text feature that you want to be able to search through. This feature can contain pieces of text in many languages and range from a few words to entire paragraphs. 

When you create an experiment, you can define exactly which model you want to use to process your text features. As soon as you deploy this model, the Peltarion Platform will use it to index all the text examples contained in your dataset.

The Peltarion Platform can create an index of text examples contained in your dataset, and associate them with their encoded vector form.

Later, when you submit a new piece of text, it will be encoded using the same model. Its embedding is then compared with those from all the text examples contained in your index, and the closest matches can be returned.

This lets you search text with the flexibility of deep learning without having to write a single line of code.

When you make a search query, the query is encoded and can be compared very quickly to everything in your index, and the most similar matches can be returned.

05/ Going further




  • Romain Futrzynski

    Romain Futrzynski

    Senior Application Engineer

    Romain Futrzynski an application engineer at Peltarion. He has several years of experience working in Computational Fluid Dynamics, notably to ensure customer success of simulation engineers at Siemens Digital Industries Software. Romain is passionate about computer science and deep learning, and about sharing knowledge between all branches of AI and Engineering. He has a PhD in Mechanical Engineering from KTH.

More to read