Image similarity with deep learning explained
01/ What is image similarity?
02/ What can you do with image similarity?
Finally, if you like to have things well organized, image similarity can be applied to data clustering. This allows you to leverage a combination of explicit information, like the type of clothes, and visual features that are learned implicitly by a deep learning model.
03/ How does image similarity work?
The fact that autoencoders are (more or less) continuous functions of their input, and that they can be trained for noise robustness, is very good news for similarity applications. It means that inputs that look the same should have nearly identical compressed representations.
However, it could be argued that if the decoder is allowed to be arbitrarily complex, it could possibly behave more like a hash function. This would make it difficult to compare the compressed embeddings directly.
That’s why in practice, the base architecture of similarity models resembles classifier networks more than autoencoders. Arguably this preserves much less information, but it allows to make comparisons with the cosine similarity. Not to mention that classifier networks are much more widespread and easier to train than autoencoders.
04/ Why use cosine similarity?
When it comes to similarity, “cosine similarity” is on everyone’s lips as it's the operation that is almost always used to measure how close two items are. So when you read that two images are 90% similar, it most likely means that the cosine similarity of their compressed representation is equal to 0.90, or equivalently that they are 25 degrees apart.
Bringing in trigonometry might seem unnecessary when dealing with deep neural networks, but it comes rather naturally from the way networks calculate. To understand why, let’s have a look at a simple classifier network.
Here’s a classifier that I trained on 3 classes from animals with attributes 2. The encoder does the heavy part of image processing and compresses the input into coordinates in an N-dimensional space (here N=2).
The “decoder”, however, is reduced to a single layer and can merely act as a classifier, as opposed to reconstructing the entire input image. To decide if a point coming from the encoder belongs to a given category, it can only do a very simple thing: draw a line (or plane) in the N-dimensional space. If a point is on the right side of this line, it belongs to the category. If the point is on the wrong side, it doesn’t belong to the category from this line.
Both the encoder that calculates the coordinates and the classifier that draws the lines are trained together so that every image from a known category can be correctly classified. As we increase the number of known categories (especially if the categories are mutually exclusive), we can see that the points created by the encoder organize themselves in clusters spread around a circle.
As a result, the angle between two points becomes a natural way to compare them. The cosine similarity, which maps an angle between 0° and ±180° to a nice value between 1 and -1, offers an intuitive measure of similarity.
05/ How is image similarity different from classification?
It’s true that image similarity is closely related to image classification, if only because it uses the same classifier networks as processing workhorse. However, there are a couple of points that make image similarity its own thing.
Image similarity considers many dimensions
In an image classification problem, we are only concerned with figuring out whether or not an image belongs to one (or possibly several) discrete categories, which means that a lot of information is discarded from the final output.
By using the representation of the data before the final classification layer, more aspects of the data can be used.
In the example below, I give a twist on the good old MNIST dataset. I trained a network to recognize handwritten digits, and surely enough the points get clustered according to their category—the digit between 0 and 9. However, I also defined 2 more dimensions that the network had to learn:
- The parity: whether a digit is even or odd
- The magnitude of the digit
The cosine similarity between two images combines all these dimensions and returns a single value that is minimal for the same digit, slightly bigger for consecutive digits of same parity, and largest for digits of different parity.
Dimensions can be learned implicitly
If you trained a classifier network to recognize dolphins, tigers, and horses, these animals are the “training domain” and you shouldn’t count on classifying anything else. According to the image below, the network I trained previously would classify most images of zebras as being a tiger, which is simply wrong.
The cosine similarity combines many dimensions that are learned implicitly, presumably here, the striped pattern and the presence of legs or grass, into a fluid numeric value. As a result, you get from the same network that zebras look like something between horses and tigers, even though you didn’t train explicitly to recognize the class of zebras.
This particular network might find zebras a bit more similar to tigers than you would, but thankfully the network does so using deterministic operations that can be tuned.
Image similarity can be trained differently
If you’re willing to let the network learn implicit dimensions entirely on its own, you might not even need any labeled categories to train on.
Using labels only indicating if two images are similar or not, networks can be trained thanks to contrastive losses. This works if you know the similarity relationships between every pair of images in a dataset, or only between some of the pairs.
Such loss functions are rather different from the loss functions traditionally used to train classification networks. Indeed, their purpose is not to push an input image towards a fixed category but towards another image known to be similar.
Business blog: Image similarity available on the Peltarion platform
Tutorial: Find similar images of fruits
Meetup talk: Image similarity with deep learning - digital meetup with Solita & Romain Futrzynski/Peltarion
Comment on this post on medium.
A Simple Framework for Contrastive Learning of Visual Representations, Ting Chen et al., 2020
Self-Supervised Similarity Learning for Digital Pathology, Jacob Gildenblat, Eldad Klaiman, 2020
Supporting large-scale image recognition with out-of-domain samples, Christof Henkel, Philipp Singer, 2020
Deep Face Recognition: A Survey, Mei Wang, Weihong Deng, 2020
Unsupervised Feature Learning via Non-Parametric Instance-level Discrimination, Zhirong Wu et al., 2018