Text embedding (beta)
The Text embedding block is used to turn a text into a vector of real numbers. The name comes from the fact that this process creates an embedding space for the text.
The Text embedding block is closely connected with text encoding in the Datasets view. They are part of the same workflow when you want to do sentimental analysis.
A Text embedding block can only be used immediately after an Input block, where a text encoded feature must be selected. Make sure that the Language model you select match the Language model selected when setting text encoding.
Use text embedding block for sentiment analysis
Use the text embedding block when you do sentimental analysis, that is, the emotional state of a text. We define sentiment analysis as the same thing as text classification.
Example: Find out if a tweet is positive or negative or predict ratings a review renders.
How does text embedding work
Text encoding transform text into tokens. This process breaks a stream of text up into words using the selected Language model.
Example: The sentence "I like jazz" is transformed to the tokens "I", "like", and "jazz".
When you select a feature that should be text encoded in the Datasets view, this feature will be tokenized based on the selected Language model. This means that the language model translates the raw text string to a sequence of tokens that represents that specific word.
The tokenizer used by the Peltarion Platform simulates the europarl tokenizer. This means that:
Punctuation is tokenized. E.g. ‘.’ and ‘;’ are individual tokens that have their own embedding vectors.
Text is treated as case sensitive. E.g. The and the is not the same token.
When the sample text is less words than set sequence length, the tokenization vector is padded with index 0. Padding is done at the end of the token vector.
Out-of-vocabulary words (OOV) gets the index 1.
Numbers and special characters (like €, & etc) are also tokenized.
Tokens to vectors (word embeddings)
In the Text embedding block, the tokens are transformed to vectors, called word embeddings, with numerical representation of the word´s semantic meaning. That is, the word embedding inherits the meaning of the word. Also, in a language one topic can be expressed in many ways. These synonyms will have similar word embeddings. These word embeddings capture hidden information about a language, like word analogies or semantic meaning.
Example: Words like king and queen mean things that are a bit similar, in this case that both words are mean that they are royal.
The vectors created by text encoding preserve these similarities, so words that regularly occur nearby in text will also be in close proximity in vector space. If two words or documents have a similar word embeddings, they are semantically similar.
Transfer learning with prebuilt embeddings
If you select Prebuilt embedding as the Embedding type you can select the pretrained fastText embedding. This is a kind of transfer learning where you transfer fastText’s learnt embeddings to your model.
fastText is a library with word embeddings for many words in each language. Use fastText for efficient learning of word representations and sentence classification, the job with creating word embeddings have already been done, fastText has all the vectors for the words. When choosing fastText embeddings, we currently limit the word vectors dictionary size to top 50 000 most frequent tokens.
Language model: A list of available language models. These are pretrained word word embeddings trained on Common Crawl and Wikipedia using fastText.
Make sure that the Language model you select match the Language model selected when setting encoding.
Embedding type: Randomly initialized is used if you want to train on your own data or Prebuilt embedding where you use word embeddings provided by, e.g., fastText.
Output dimension: Output dimension for pretrained fastText embeddings are 300. Default: 64
(Available if Randomly initialized is selected as Embedding type.)
Embedding: Pretrained embeddings. Default: fastText
(Available if Prebuilt embedding is selected as Embedding type.)
Trainable: Whether we want the training algorithm to change the value of the weights during training. The box can be checked or not checked.