Images and videos may take up a lot of space on the Internet but with 300 billion emails, half a billion tweets, and over 3 billion Google searches made each day, text is still a big player in digital life.
Let's see how deep learning can help you navigate through endless lines of prose.
Search text by semantic similarity
02/ Crash course in NLP
Natural Language Processing (NLP) is the field of data science that deals with handling text written in a natural way, as opposed to text written in a programming language or in a SQL query.
NLP starts with the tokenization and embedding of text into high-dimensional numerical vectors. Modern tokenization is much more efficient in creating units of information compared to a word-by-word or character-by-character split, and it can often be applied across many languages and alphabets. Text embedding converts these tokens into vectors that have the remarkable property to let mathematical operations reflect their natural meaning.
An NLP model can use these embeddings to solve a task, like identifying the mood of the text, its genre, or its similarity with another piece of text. The state-of-the-art models are transformers that use attention, like BERT and XLM-R, but other models like the Universal Sentence Encoder (USE) can get satisfying results with shorter computation times.
03/ What is text similarity?
When text is converted into embedding vectors, an NLP model can perform numerical operations on them to find out various kinds of information. In the case of text similarity, the model is trained not to output a category or a score, but a new embedding vector that encodes the entire input text.
Full models like the USE Embedding snippet and the XLM-R Embedding snippet (SBERT-nli-stsb) are pretrained specifically so that when we compare two of their sentence embeddings by cosine similarity, the result indicates how similar the two input sentences are.

A model encodes natural text as a high-dimensional vector of values. Calculating the cosine similarity between these vectors gives the semantic similarity between different texts.
There are several advantages to using deep learning for searching through text.
It's lightning-fast
We’ll concede that some models can be on the slow side when training. However once deployed, it takes figuratively no time to encode a new sentence and to compare it to thousands of indexed entries.
Unlike comparing strings character by character, the cosine similarity is computed with pure vectorized math operations. Not only is that super fast, but it also doesn’t get slower when you work with longer texts.
It’s forgiving
Since the text comparison doesn’t happen at the character level, the comparison can be a lot more tolerant to some differences between your search query and the indexed data.
- Tolerance to spelling
- Tolerance to synonyms
- Tolerance to language
- Tolerance to style
At the basic level, variations in spelling (like upper case, accents, plurals, or typos) are greatly mitigated when the text is tokenized and converted into vectors.
Then the full power of working in a vector space comes into action.
Synonyms that may look very different when typed--and may even be from different languages--will end up as different yet nearby points in the embedding space.

Even when two sentences use different words, different styles, and different languages, their encoded forms are nearby in the vector space that NLP models work with.
The same benefits apply at the sentence level, so that two sentences written with different words and different styles end up as nearby points in the final sentence embedding, letting you search text by meaning rather than by keyword.
04/ How does the Peltarion Platform find similar text?
You start by uploading a dataset containing a text feature that you want to be able to search through. This feature can contain pieces of text in many languages and range from a few words to entire paragraphs.
When you create an experiment, you can define exactly which model you want to use to process your text features. As soon as you deploy this model, the Peltarion Platform will use it to index all the text examples contained in your dataset.

The Peltarion Platform can create an index of text examples contained in your dataset, and associate them with their encoded vector form.
Later, when you submit a new piece of text, it will be encoded using the same model. Its embedding is then compared with those from all the text examples contained in your index, and the closest matches can be returned.
This lets you search text with the flexibility of deep learning without having to write a single line of code.

When you make a search query, the query is encoded and can be compared very quickly to everything in your index, and the most similar matches can be returned.
05/ Going further
Watch
- Webinar: A closer look at text similarity
Build
- Create your first text similarity model: Find similar Google questions - tutorial
- Cheat sheet: Text similarity/cheat sheet
Explore
- Get practical: A business view on semantic similarity - blog
- Similarity applied to image search: Image similarity with deep learning explained - blog
Related topics
06/ More on Data science
- Data science /
Towards a token-free future in NLP
- Data science /
Improving multilingual models for low-resource languages
- Data science /
Extending BERT-like Language Models for Multiple Languages and Longer Documents
- Data science /
Predicting Bitcoin prices with AI
You need not have deep pockets or talented engineers to predict Bitcoin prices. Just use Python to download publicly available data and the Peltarion platform to train an AI on it. How hard can it be? As it turns out, it is not hard at all.
- Data science /
Building a whiskey classifier with Bubble and Peltarion