Data science /

A Language agnostic model that can handle surf lingo

March 25 2020

Translation of everyday phrases has come along way over the last few years. But as soon as you step into a niche use of language, specific to a certain industry or expertise, it gets much trickier. You can't simply enter financial or automotive-specific phrases into Google translate and expect the outcome to be spot on.

One of my big passions is surfing, so I will talk about what deep learning can do for language, from a surfer's point of view.

When people hear about Sweden, they tend to think of a rather cold country up in the north associated with skiing, Abba, and perhaps Zlatan. Probably not a destination people have in mind when looking for places to go and surf.

I spend a lot of time surfing in Sweden, especially at Torö, which is located at the southern tip of our archipelago. During the years of surfing in Sweden, I have stumbled upon some dedicated cold water surfers, and a WhatsApp chat group was created 4 years ago, and we are now seven members in the group.  We tend to chat almost every day, mostly about wind forecasts, but a lot of topics unrelated to surfing come across, well, we are starting to get close as friends.

Not all the people in the chat are native Swedish, but we mainly text in Swedish while chatting. We have one guy from Bali, another one from California. They tend to be a bit quiet in the chat due to language barriers, but are very talkative in person. One could think that this a perfect case for using Google Translate or similar or that we should be more polite as the typical Swede and chat in English.

Probably the best way to be a part of a group is to use the same language, and especially the same jargon. That is to use the right words in the right context, to sound like you have been a part of the group for a long time. Since all groups have different ways of expressing themselves, a simple translation would be too generic, you would not sound cool enough.

Photographer: @kookshred

Thankfully words can be approximated as numerical vectors, meaning a computer can have a chance of understanding language. A vector in a 3-dimensional coordinate system could look like <2.2, 4.2, 1.3>, and you could imagine your living room as the space where all the vectors can exist. The particular vector <2.2, 4.2, 1.3> representing the word “cat” is located behind the tv, and a neighbouring vector below the tv <2.2, 4.0, 1.3> representing “kitten”. But the vector <0.0, -3.1, 1.8> representing “smartphone” is located under the sofa. In this case, perhaps the second element in the vector is the amount of cat property a particular word has.

There exist various algorithms to represent words as vectors, such as Word2Vec, Glove, fastText (available on our platform). A word can be represented by a vector with 300 dimensions instead of 3 as in our living room, where each axis captures some textual information.

Can you imagine your living room in 300 dimensions?

As you notice vectors with similar semantics (meaning) are located nearby, and vectors further apart have less shared semantic meaning.  

A slight problem with those algorithms is that they don’t consider the context or the ordering of the words, hence the sentence “the owner feeds his cat”, and “the cat feeds his owner” would be represented by the exact same vector when taking the average of the word vectors. In some cases this is totally fine, but thanks to the recent development in natural language processing new representations that actually considers the context and ordering of words exists. To name a few ELMo, ULMFiT, GPT2, and BERT which is available on our platform as well. 

This sounds great, but what about the surfers from Indonesia and California?

Photographer: @kookshred

So apparently there exists techniques in natural language processing (NLP) to represent text as vectors that even consider the context and ordering of words. Does this mean each language needs its own model?

As it turns out, not really, we can now build deep learning models that have learned the semantics of several languages simultaneously. Those models can represent sentences with similar meaning in different languages as vectors close to each other in your living room.  (accepting submissions for witty IKEA jokes here)

“The cat sat on the mat”, and in Swedish “Katten satt på mattan” would be located very close to each other in a thousand dimensional coordinate system, this happens to work with languages with very different structures as well. The model used in this example was trained on almost one hundred different languages. You see where this is going? 

But wait a second, how do I know which vectors in my high dimensional multilingual living room are closest to the cat-vector typed in English? A carpenters' rule in this context could be euclidean distance or cosine similarity

Photographer: @kookshred

As a hands on example, the WhatsApp group with 4 years of daily data and roughly 50 000 texts was mapped into the multilingual semantic space using one of the most recent models in NLP (soon to be available on the Peltarion platform). The model was then made fully accessible through an user interface and api to our dear non Swedish surfers. 

Queries to the model in English and Indonesian

Californian surfer: “Surf tomorrow?” 

Indonesian surfer: “Berselancar besok?”

Ranked top 3 suggestions from 50 000

“Tja gubbs! Tror ni på surf imorgon?” 

(Hey lads, do you believe in surfing tomorrow?)

“Chans för episk surf eller är det vanligt stökigt torö?” 

(Chance for epic surf tomorrow or is it just regular messy torö?


“Är det någon som tänkte dra en sesh imorn btw?”

(Is anyone thinking of (pulling?) a session tomorrow by the way?)

All three suggestions have similar meaning with varying formulations, and in bold the jargon specific writing style is seen. The non Swedish surfers could simply pick one of the phrases and sound like they were a part of the group.

We will continue to work on the latest NLP models, and make sure the platform is up to date with the latest capabilities. I'll keep surfing and trying to bridge the language barriers in surfing with the help of Deep Learning. What niche language problem is relevant for you? Maybe our platform can help bridge it.

Start with one of our available datasets, or upload one of your own.

  • Tim Isbister

    Tim Isbister

    Tim Isbister is member of the applied AI team at Peltarion and has a role as a data scientist. He has a strong interest in NLP and being involved in the whole loop of creating value for clients. He previously worked with the Swedish Defence with author profiling related subjects, and holds a degree in Computer Science from Uppsala University.

02/ More on data science