Data science /

Extending BERT-like Language Models for Multiple Languages and Longer Documents

August 3 2021/5 min read

The progress of AI research has exploded in the last couple of years - with a milestone being the BERT model released by Google. Near human-level performance has been achieved in several different tasks, varying from text classification, question-answering, and more. BERT set a trend for text-based AI models and what these models could do ever since.

However, most of these AI models are only trained on short sentences in English, making them limited in use. We released a new AI model for text to make these models more applicable in different settings and languages. This blog post will explain what it is and how it could be beneficial to you.

02/ TL;DR

We created a language model that can deal with 8x longer text passages than the standard BERT or RoBERTa model in over 100 different languages, called XLM-Long. It is based on a multilingual XLM-R model that has been converted to a long-context Transformer called Longformer. The model is made available to use from Huggingface Transformers website, and in the future, here on the Peltarion platform.

03/ AI Models for Texts

A large part of our lives is spent creating, parsing, and reading text. Having an intelligent system that can understand and automate processes based on that will have and does already have huge potential for us as a species. Previous models existed before BERT but were often handcrafted to solve a particular problem. This meant that new models would need to be re-trained from scratch, even for related tasks. Two aspects among those that made BERT and subsequent transformer-based models so performant and promising for text-based problems are transfer learning, which allowed the model to leverage learned knowledge from one task to another, and a concept called attention.

04/ Pay Attention: How Models Learn Languages

Training an AI model is done in the form of learning by doing. Prior to initial training, an AI model has no prior knowledge of what a language is or how it is structured. It learns this by continual exposure to the languages and tasks and by constant feedback on if its guesses are correct or incorrect. Eventually, it learns how a language is constructed, and using similar techniques can learn to solve specific tasks. But languages are complex. Different words can have different meanings depending on the context. Mapping the importance of each word to their context is called attention, and is how language models efficiently learn languages and tasks.

This is how attention works. The model looks at X number of words at a time and tries first to understand for each word how important it is in context to the X other words, and then for the task we want to solve. For those interested in learning more about how attention works, I would recommend these  two blog posts.

05/ The Limitation of Attention

Though these models are very performant in the representations that they learn, this is also a limiting factor inherent in their model architecture. Because computing attention is so computationally expensive, most models limit this to around 512 tokens (sub-parts of words). Because the models train with a short attention span, they don't learn long-term dependencies between long sentences. More recent research has focused on methods for improving this in existing Transformers, such as the Reformer, Transformer-XL, or Longformer models. These models, as with most deep learning research, have focused exclusively on the English language, making them limited for real-world use. To use these models on, e.g., Swedish data, we would need vast amounts of long-context data, which might not be available. This process would then need to be repeated for every new language we want to use the model for.

It is not feasible to re-purpose even a fraction of the research models into the over 7000 languages spoken in the world. However, what some recent models have done (XLM-R and mBERT) is to train existing models on over 100 of the most common languages in the world. These models have shown to be on par with existing monolingual models or even out-performing them in some instances. Additionally, they have proven to adapt surprisingly well to new languages, previously not seen by the model.

Training a language model from scratch takes a long time, a large-context model more so and a multilingual model exuberantly more. We speculated that the ideal way to create a model that could leverage longer context in multiple languages would be if we could start from a multilingual model and extend its context with additional training. Luckily, there does exist such models: one of which is called the Longformer.

06/ The Longformer

The Longformer model, made by researchers from Allen AI, is a Transformer agnostic model that can re-use part of the attention mapping a language model, such as BERT has already learned, and re-purpose that to efficiently learn contexts over longer passages. After extending the attention span of a model it is trained on a dataset with long sentences to learn those new mappings quickly. Using this method, we extended the attention span of a multilingual XLM-R model to 8 times its original size and pre-trained it on the English long-context dataset WikiText-103. To verify if the model had learned the mapping, we fine-tuned it on several question-answering tasks in multiple languages.

The reasoning behind this is that question-answering generally requires the model to have a deep understanding of languages, and is also the task with the most data available in multiple languages.

07/ How I Encountered the Problem

I first encountered the problem when creating a website that would allow users to upload their own documents and search the documents by asking specific questions you had about them. The goal was to create an application that would make it easier for people, like my grandpa, to quickly find out in terms of service agreement, or similar documents, what he is and isn't allowed to do on a website - Taking document search into the 21st century.

Along with some friends, we created a simple app that showed promise but was not as performant as we would have expected. From this initial experimentation, we noted, among other things, the imitating attention span of current pre-trained Transformer models, and the (at the time) a limited number of languages a model could handle. Since then, there has been a huge initiative on democratizing these models to more languages and researchers are actively researching the optimal architecture for performance and long-context. In Sweden, researching initiatives, such as VINNOVA and Wallenberg AI (WASP) has bolstered the effort and progress of making AI models available for everyone in Swedish. The outcome of the thesis is a direct result of the VINNOVA financing of Swedish language models.

08/ We at Peltarion

At Peltarion, we strive to advance humankind by making AI models applicable to real-world problems and available for everyone. To that end, we released the pre-trained model on Huffingface hub and also source code for those interested in using the model or replicating the results.

    • Markus Sagen

      Markus Sagen

      AI Research Engineer

      Markus is a AI research engineer at Peltarion with a M.Sc. in Information Technology and Machine Learning from Uppsala University. At Peltarion, he focuses mainly on natural language processing and deep learning integration in research and applications. Apart from NLP, he has a special interest in information retrieval systems, time series analysis, audio processing, software development, and reinforcement learning.