Thankfully words can be approximated as numerical vectors, meaning a computer can have a chance of understanding language. A vector in a 3-dimensional coordinate system could look like <2.2, 4.2, 1.3>, and you could imagine your living room as the space where all the vectors can exist. The particular vector <2.2, 4.2, 1.3> representing the word “cat” is located behind the tv, and a neighbouring vector below the tv <2.2, 4.0, 1.3> representing “kitten”. But the vector <0.0, -3.1, 1.8> representing “smartphone” is located under the sofa. In this case, perhaps the second element in the vector is the amount of cat property a particular word has.
There exist various algorithms to represent words as vectors, such as Word2Vec, Glove, fastText (available on our platform). A word can be represented by a vector with 300 dimensions instead of 3 as in our living room, where each axis captures some textual information.
Can you imagine your living room in 300 dimensions?
As you notice vectors with similar semantics (meaning) are located nearby, and vectors further apart have less shared semantic meaning.
A slight problem with those algorithms is that they don’t consider the context or the ordering of the words, hence the sentence “the owner feeds his cat”, and “the cat feeds his owner” would be represented by the exact same vector when taking the average of the word vectors. In some cases this is totally fine, but thanks to the recent development in natural language processing new representations that actually considers the context and ordering of words exists. To name a few ELMo, ULMFiT, GPT2, and BERT which is available on our platform as well.
This sounds great, but what about the surfers from Indonesia and California?