The current state-of-the-art method for solving text-based problems includes separating sentences into sequences of tokens. Relying on tokens is, for the most part, a necessary evil. Recent approaches have shown the viability of removing learned tokens altogether and instead operating on the raw text directly. This blog post will highlight what token-free models are and what the big deal is. The future may very well be token-free!
Towards a token-free future in NLP
03/ The Problem
04/ Token-Free models in theory
The token IDs are then embedded in a vector space to encode more information and enable comparison between words; a so-called word embedding. In Transformer based models, this embedding is part of the model; called contextual word embedding. The model then learns through multiple layers what the relation between different tokens are.
The general process for how text is represented and passed to a model both in a tokenizer-based approach and without a tokenizer are similar, and can be described as:
- Represent the text as numbers (via tokenization or converting the text to Unicode code point with UTF-8 encoding).
- Create a word embedding out of the numerical representations for capturing complexity and comparison between words.
- Reduce the number of words or sub-words the model needs to represent. This can be before the text is encoded, such as splitting the text into sub-words with a tokenizer or reducing the number of characters to represent in the word embedding via mean pooling (Charformer), strided convolution (CANINE), or reduced self-attention cost (Perceiver).
05/ Token-Free Models
CANINE is the first token- and vocabulary-free model, based on a hashing and downsampling strategy to work directly on the characters as Unicode code points. CANINE was trained on the TyDI QA dataset and outperformed other multilingual models, such as mBERT while having no predefined tokenization and 28% fewer parameters.
The input is then passed through a regular stack of Transformer Encoders (s.a. mBERT or XLM-R). Depending on the task, the first token is then used (the [CLS] token) for classification and the output sequence of the model is upsampled to get back to the original size of 2048.
ByT5 is a variant of the multilingual T5 model, mT5, but operates directly on the raw text input or UTF-8 encoding of the text. The architecture is otherwise similar to mT5, but the split between encoder and decoder layers are no longer 50/50, instead, the number of encoder layers is 3x more than the decoders. It is hypothesized that token-free models need deeper encoder stacks to make up for the decreased embedding capacity for the vocabulary.
ByT5 out-performs mT5 in most multilingual tasks, and especially for smaller models or when dealing with misspelled or noisy data, and is 50-100% faster. When the model is trained on a single language, s.a. English, mT5 and the regular T5 perform better.
Perceiver and Perceiver IO
The perceiver operates directly on the raw byte representation of the input. This enables the models to operate (more or less) on any type of data, be it text, images, point cloud, audio, etc., and even combinations of modalities in one model. The model takes inspiration from the ByT5 paper to operate directly on the raw byte representation (UTF-8 for text) but extends it to multiple modalities.
The Perceiver also continues the trend of removing hardcoded assumptions in the model architecture of how to solve and represent the problem, and instead allows the model itself to learn those aspects. The Perceiver IO is a continuation on the original Perceiver architecture, extending it to be used for multiple tasks and not just classification.
Charformer consists of two parts: a dynamic, fast, and flexible method to learn subword representations automatically from n-grams and a model that incorporates it. By grouping n-characters together (n-gram) there is an increased opportunity to learn multiple representations of a word that may be more advantageous. Instead of using only one representation of subwords of a single character, the model can select the most informative representation of a word, by weighting multiple representations from the different n-grams. These are then downsampled in groups of 2 with mean pooling to get a sequence with a shorter length.
This module is called Gradient-Based Subword Tokenization (GBST) and is the token-free module used by the Charformer. Since all components in the module are pre-defined, except for how to weight/score each n-gram representation, this can be done efficiently and quickly. Also, since the scoring is done using the Softmax function it is also differentiable and learnable. This means that better text representations can update on new vocabulary or languages dynamically.
Creating an n-gram of characters shortens the length of the text by n. For instance, the text “Successfully” as a 4-gram would be “Succ”, “essf”, “ully”; that is 4 times shorter than the original text. Therefore, these n-grams are mean pooled and repeated X number of times to be the same length again. The pooled and duplicated embedding for “succ” in the image below would be C4, 1, “essf” C4, 2 , and “ully” C4, 3. These are then scored via a weighting and mean pooled to a shorter representation. Since pooling removes the position of tokens, position embeddings are added to the tokens at each pooling step.
Charformer performs on par or outperforms the regular T5 on multiple English tasks and outperforms both ByT5 and CANINE while being smaller, faster, and with shorter sequences. Unlike CANINE, a model using the GBST, s.a. Charformer is interpretable in how the tokens are represented. Charformer is as of this writing the current State-of-the-Art (SOTA) method when it comes to token-free models. For those interested in learning more about the model, I highly recommend this short and pedagogical video.
06/ Delving further
Improving multilingual models for low-resource languages
Extending BERT-like Language Models for Multiple Languages and Longer Documents
Predicting Bitcoin prices with AI
You need not have deep pockets or talented engineers to predict Bitcoin prices. Just use Python to download publicly available data and the Peltarion platform to train an AI on it. How hard can it be? As it turns out, it is not hard at all.
Search text by semantic similarity
Building a whiskey classifier with Bubble and Peltarion