Natural language processing (NLP) is one of the most well-known applications of AI. It dates back all the way to 1954 and the Georgetown experiment, when a group of scientists was able to program a computer to translate 60 sentences from Russian into English. They were very happy with the results and believed that in five years’ time we would have a machine that could translate any sentence to any language. As we all know, that did not happen; it turned out that machine translation and NLP were significantly more difficult than imagined.
A year in review: NLP in 2019
Simply programming a computer manually for NLP is too difficult, and it is not sufficient to use statistical approaches such as TF-IDF either. It is necessary to make use of machine learning techniques to automatically learn how to predict e.g. what a translation should be. When it comes to images, deep learning has made significant progress, starting in 2012 with the AlexNet model. In the past few years, there has been a rapid progression of models improving our ability to understand images using AI.
When it comes to NLP, in 2017 Google released the Transformer model that broke state-of-the-art results for machine translation. It revolutionized how deep learning works by introducing self-attention as an alternative to the otherwise popular convolutional and recurrent neural networks. In late 2018, the BERT model was released building upon the Transformer architecture. That model broke state-of-the-art records for many different NLP tasks at once such as text similarity, question answering and language inference. This represents a similar revolution for text, as the AlexNet model was for images back in 2012.
Transformer-based NLP models
The progress of BERT and other Transformer-based models for NLP has been amazing in 2019. New models based on BERT have been coming up basically every month during 2019, such as XLNet, RoBERTa, ERNIE, XLM-R, and ALBERT. Perhaps one of the most talked-about models was OpenAI’s GPT-2 model, which was unbelievably good at generating text. I recently found an article using that model to generate predictions about AI in 2020, and it demonstrates how GPT-2 can give surprisingly well-formulated answers. Here are some examples from the article:
- Human: “What is the future of AI?”
GPT-2: “It would be good if we used the technology more responsibly”.
- Human: “Are you worried that AI technology can be misused?”
GPT-2: “Yes, of course. But this is a global problem and we want to tackle it with global solutions”
- Human: “Do you think Donald Trump will win a second term?”
GPT-2: “I think he will not win a second term. I think he will be defeated in the general election.”
NLP is one of the hottest trends in AI right now and all the scientific advances from 2019 will result in an increasing number of industrial NLP applications in 2020. With the latest NLP models, it is possible to understand the semantics of text to a degree never seen before. See figure 1 below for a screenshot from a Peltarion prototype, making use of a BERT model trained on the QQP dataset.
As noted above, the prototype returns similar sentences from the 400k questions in the QQP dataset. In the screenshot, the query “Will artificial intelligence end the world” is given to the prototype and the most similar questions from QQP are returned. The first answer is very similar, both in terms of syntax with many words in common with the query and also very similar in terms of the semantics. However, even more interesting is the second to most similar question “Is technology destroying humanity”. It has no words in common at all with the query, but the semantic of the question is still very similar to the query. This type of semantic understanding of text has not been possible before.
Swedish NLP research projects
There have been a number of AI-initiatives started in Sweden during 2019, all aiming to accelerate AI-adopting in Sweden. I have personally been involved in a number of them, including building up an AI Agenda for the Swedish government and starting to build up both national and regional nodes though AI Innovation of Sweden. Vinnova is funding a large number of these initiatives and it’s great to see that they have decided on funding a number of NLP research projects. Peltarion is part of many of them, including to build up a Swedish language data lab, a medical language data lab and a large three-year research project to build up a Swedish language model for Swedish authorities.
We are just getting started with these projects and expect to make significant progress during 2020. We have already launched an English BERT model on the Peltarion Platform and are going to publish a BERT based model for Swedish during 2020. Note, it will not be the original BERT, but probably some of the extensions that have been made to BERT such as ALBERT or XLM-R. We have already started to experiment with XLM-R, which is a BERT model trained on many different languages at once. The exciting news with XLM-R is that a model trained on many languages can outperform models trained for a single specific language. Initial experiment results indicate that XLM-R can outperform a vanilla BERT model trained specifically for the Swedish language.
An important research question for this project is how to create task-specific datasets such as natural language inference, text classification, similarity and sentiment analysis. XML-R was developed by Facebook and evaluated on a dataset called XNLI. We are considering extending that dataset with Swedish and have already received confirmation from Facebook that they could update their dataset if we were to extend it. Great news indeed - expect a lot of other contributions to a Swedish language model coming in 2020!
Except for the large technology giants, i.e., GAFA-companies such as Google, Amazon and Facebook, most other companies still struggle to operationalize the latest AI and NLP techniques. Why? The tooling is still immature. It is made by GAFA-companies for GAFA-companies. This is one of the core problems that we at Peltarion are working hard to solve. We launched an initial BERT model on the Peltarion Platform late in 2019 and we expect to see many companies start to operationalize NLP and other AI techniques in 2020.
There has been lots of talk surrounding data safety, privacy and ethics in AI in 2019 and this topic is extra important for NLP. Significant progress has been made in these areas from various groups such as the EU Commission and the high-level expert group presenting and their Ethics guidelines for trustworthy AI results. However, it is also important to turn the question around and ask “How ethical is it not to use AI?” What happens if we do not take advantage of the possibilities of AI to keep Swedish companies competitive and for all societal challenges we have, such as improving healthcare with reduced time to treatment and to fight climate change?
02/ More on Data science
Search text by semantic similarity
Building a Stack Overflow question tagging model with public BigQuery data
In this blog post, we'll show how you can shape one of these public datasets, the Stack Overflow posts dataset, to train a question tagging BERT model.
Connecting the dots in Neural Networks
Image similarity with deep learning explained
How can a neural network deal with a vague concept like similarity? Why do we measure angles with the cosine similarity? Is it that far from classification?
Peltarion on Microsoft's AI Show