What’s the matter with context?
If you see the word bank, you might think about a financial institution, the office where your advisor works, the portable battery that charges your phone on the move, or even the edge of a lake or river.
If you’re given more context, as in It's a pleasant walk by the river bank, you can realize that bank goes well with river, so it must mean the land next to some water. You could also realize that you can walk by this bank, so it must look like a footpath along the river. The whole sentence adds up to create a mental picture of bank.
Self-attention seeks to do the very same thing.
Word embedding
A word like bank, called a token when it represents a fundamental piece of text, is commonly encoded as a vector of real, continuous values: the embedding vector.
Determining the values inside the embedding vector of a token is a large part of the heavy lifting in text processing. Thankfully, with hundreds of dimensions available to organize the vocabulary of known tokens, embeddings can be pretrained to relate numerically in ways that reflect how their tokens relate in natural language.
How to contextualize the embeddings?
The key to the state of the art performance in Natural Language Processing (NLP) is to transform the embeddings to create the right numerical picture from the tokens in any given sentence.
This is what the scaled dot-product self-attention mechanism does elegantly with (mostly) a few operations of linear algebra.