Find similar Google questions
Use text similarity to find out what similar questions others have asked.
Autocomplete search queries
Suggest answered similar questions
Identify most common topics
- Target audience: Beginners
- Estimated time: 15 minutes
You will learn to
- Build and deploy a model for text similarity
- Use the output from a specific block in the model
Why text similarity search?
Text similarity is a way to quantify the similarity between two pieces of text, for instance, two questions written in natural language.
Similarity search is faster than direct text string comparison, and allows some flexibility in the results.
Create a project
First, click New project to create a project and name it, so you know what kind of project it is. Naming is important.
Add the Google Natural Questions dataset to the platform
After creating the project, you will be taken to the Datasets view.
Click the Import free datasets button.
Look for the Google Natural Questions - tutorial data dataset in the list.
Click on it to get more information.
The Google Natural Questions dataset
This dataset consists of over 300000 questions submitted by real people. We’ll learn how we can search it for questions that are similar to any new text input.
If you agree with the license, click Accept and import. This will import the dataset in your project, and you will be taken to the dataset’s details where you can edit features and subsets.
Save the dataset
The dataset is automatically set up, so you just have to click Save version.
Then click Use in new experiment to open up the Experiment wizard.
Build model with the Experiment wizard
Make sure the training and validation subsets are selected.
Inputs / target tab
Select question as Input and index as Target. (The target won’t be used to train a model but we need it to define a complete model graph.)
Select Problem type Text similarity.
Select the USE Embedding snippet.
Finally, click Create and you will find the model in the Modeling view.
The USE Embedding model is already pretrained for the purpose of text similarity. As a result, we don’t need to train it. We can directly run the model to create it and move to the next step.
Click Run and wait until the experiment has finished the default 1 epoch.
Navigate to the Deployment view (you can skip evaluation this time) and click New deployment.
Name the deployment and select Similarity search.
Make sure all settings are what you want them to be:
Experiment - Experiment 1, the experiment you just ran.
Checkpoint — Epoch 0 since we didn’t train the model.
Output features: question and answer. You will get these features back when you search with a new question.
Output block — Sentence embedding.
The platform will begin to index all the questions in the Google Natural Questions dataset. This might take a while.
Click Enable once indexing is finished.
Your experiment is now ready to be called via the deployment API.
How does text similarity search work?
With text similarity, you want to compare a new text with all the texts you have in your dataset to find the most similar ones. Text similarity is a way to quantify how similar 2 pieces of text are.
First, all the text questions from the dataset are processed by the model during indexing.
The resulting embedding vectors are saved in an index.
To find similar questions, you will send a new text to your model which will also be processed by the model.
The embedding vector for the new question will be quickly compared to the index, and the most similar questions from the dataset will be returned by the platform. The lower the distance, the better, so close to 0 is good.
Text similarity search is fast since the text is reduced to a small amount of numerical values.
We’ve made it super easy for you to test the deployment. Click on Test deployment, and you will be directed to our Text similarity - API tester.
All info from your deployment is prepopulated (Token, URL, Text parameter).
You just need to type in some text and then click the Play button to try it.
The most similar question with the lowest distance will be at the top. And if there is an answer to that question, it will be displayed as well.
You’ve added a dataset to your project and built a model based on the pretrained snippet Universal Sentence Encoder (USE Embedding).
You’ve indexed all questions in the Natural Questions dataset with the model.
You’ve deployed your model and tested it to find similar questions that real people have asked.
Classify text in any language
In the Classify text in any langueage tutorial we will show you how you can use the Peltarion Platform and its Multilingual BERT snippet to create a model that is able to work with multiple languages simultaneously!
You will learn how to automatically classify text extracts depending on their topic. Mix the available languages for training the model, and test it in any language.
You can get a quick understanding of text similarity search in the blog post Search text by semantic similarity.