Text embedding (beta)

The Text embedding block is used to turn a text into a vector of real numbers. The name comes from the fact that this process creates an embedding space for the text.

The Text embedding block is closely connected with text encoding in the Datasets view. They are part of the same workflow when you want to do sentimental analysis.

A Text embedding block can only be used immediately after an ​​Input​​ block, where a text encoded feature must be selected. Make sure that the Language model you select match the Language model selected when setting text encoding.

Use text embedding block for sentiment analysis

Use the text embedding block when you do sentimental analysis, that is, the emotional state of a text. We define sentiment analysis as the same thing as text classification.

Example: Find out if a tweet is positive or negative or predict ratings a review renders.

How does text embedding work

Text encoding transform text into tokens. This process breaks a stream of text up into words using the selected Language model.

Example: The sentence "I like jazz" is transformed to the tokens "I", "like", and "jazz".

When you select a feature that should be text encoded in the Datasets view, this feature will be tokenized based on the selected Language model. This means that the language model translates the raw text string to a sequence of tokens that represents that specific word.

Text tokenization

The tokenizer used by the Peltarion Platform simulates the europarl tokenizer. This means that:

  • Punctuation is tokenized. E.g. ‘.’ and ‘;’ are individual tokens that have their own embedding vectors.

  • Text is treated as case sensitive. E.g. The and the is not the same token.

  • When the sample text is less words than set sequence length, the tokenization vector is padded with index 0. Padding is done at the end of the token vector.

  • Out-of-vocabulary words (OOV) gets the index 1.

  • Numbers and special characters (like €, & etc) are also tokenized.

Example:

Tokenization of a text string
Figure 1. Each word gets a token. Note that ‘The’ and ‘the’ have different tokens and that ‘från’ doesn’t exist in English dictionary and therefore get token 1. There is padding in the end with 0’s since sequence length is set to 25.

Tokens to vectors (word embeddings)

In the Text embedding block, the tokens are transformed to vectors, called word embeddings, with numerical representation of the word´s semantic meaning. That is, the word embedding inherits the meaning of the word. Also, in a language one topic can be expressed in many ways. These synonyms will have similar word embeddings. These word embeddings capture hidden information about a language, like word analogies or semantic meaning.

Example: Words like king and queen mean things that are a bit similar, in this case that both words are mean that they are royal.

Word embedding PA1
Figure 2. The word embeddings for King shows that it means a royal male. The word embedding show that the word Boy means a young male that probably isn’t royal.

The vectors created by text encoding preserve these similarities, so words that regularly occur nearby in text will also be in close proximity in vector space. If two words or documents have a similar word embeddings, they are semantically similar.

Plot of word embeddings PA1
Figure 3. A plot of word embeddings if King, Queen, Male, Female and how they are related.

Transfer learning with prebuilt embeddings

If you select Prebuilt embedding as the Embedding type you can select the pretrained fastText embedding. This is a kind of transfer learning where you transfer fastText’s learnt embeddings to your model.

fastText
fastText is a library with word embeddings for many words in each language. Use fastText for efficient learning of word representations and sentence classification, the job with creating word embeddings have already been done, fastText has all the vectors for the words. When choosing fastText embeddings, we currently limit the word vectors dictionary size to top 50 000 most frequent tokens.

Parameters

Language model: A list of available language models. These are pretrained word word embeddings trained on Common Crawl and Wikipedia using fastText.
Make sure that the Language model you select match the Language model selected when setting encoding.

Embedding type: Randomly initialized is used if you want to train on your own data or Prebuilt embedding where you use word embeddings provided by, e.g., fastText.

Output dimension: Output dimension for pretrained fastText embeddings are 300. Default: 64
(Available if Randomly initialized is selected as Embedding type.)

Embedding: Pretrained embeddings. Default: fastText
(Available if Prebuilt embedding is selected as Embedding type.)

Trainable: Whether we want the training algorithm to change the value of the weights during training. The box can be checked or not checked.