# Glossary

Peltarion’s AI glossary include short descriptions of relevant terms. Learn the meaning of these AI definitions and you’ll be an AI professional.

A-B-C-D-E-F-G-H-I-J-K-L-M-N-O-P-Q-R-S-T-U-V-W-X-Y-Z

### Accuracy

Accuracy is a performance metric that allows you to evaluate how good your model is. It’s used in classification models and is the ratio of:

$Accuracy = \frac{Number\;of\;correct\;predictions}{Total\;number\;of\;examples}$

or equivalently:

$Accuracy\; = \frac{True\;postive\;+\;True\;negative}{True\;postive\;+\;True\;negative+\;False\;positive+\;False\;negative}$

Note that accuracy is highly sensitive to class imbalances in your data.

### Activation function

An activation function is a non-linear function which takes the weighted sum of all the inputs to a node and maps them to values in the range of 0 to 1 (e.g., Sigmoid), 0 to ∞ (e.g. ReLu) or -1 to 1 (e.g., TanH).

Its non-linear nature is what allows neural networks to model any kind of function (it makes them a universal function approximator).

Use an activation function in every node of a network. On the Peltarion Platform, all nodes are automatically assigned an activation function, which can be changed as one of the configurable parameters of the blocks that have nodes.

It is recommended to leave the parameters of this optimizer at their default values.

Adam is the optimizer of choice for training deep learning models.

Adam can be viewed as a combination of RMSprop and momentum.
RMSprop contributes the exponentially decaying average of past squared gradients, while momentum accounts for the exponentially decaying average of past gradients.
Adam adds bias-correction and momentum to RMSprop. Momentum can be seen as a ball rolling down a slope, Adam behaves like a heavy ball with friction, which thus prefers flat minima in the loss function landscape. The authors of the original paper empirically show that Adam works well in practice and compares favorably to other adaptive learning-method algorithms.

The Add block can take any number of inputs, all with the same shape, and returns a single tensor of the same shape, containing the element-wise sum over all inputs.

Useful when building residual networks, where the layer’s input is added with an Add node to its output.

A variant of Adam based on the algorithm from the paper "On the Convergence of Adam and Beyond".

### Area under the ROC Curve (AUC)

The AUC is the area under the ROC curve and is a performance measure that tells you how well your model can classify different classes. The higher the AUC the better the model. Figure 1. Area under curve

### Autoencoder

An autoencoder is a special kind of neural network that aims to copy its input to its output. What makes this a non trivial task is that rather than just directly copying the input to the output, the network first tries to encode the input into a small number of parameters (the latent-space representation) and then tries to reconstruct the original data at the output from it.

The latent-space can be thought of an efficient re-coding of the input data. This re-coding of the input data can itself be interpreted in different ways, each one of which gives us insights into the applications of autoencoders:

• Data compression

• Dimensionality reduction / feature extraction - the smaller encoding can be thought of the autoencoder having performed dimensionality reduction (using non linear transformations!) on the original data.

• Data denoising - the latent-space’s learned ability to distinguish the noise and to discard it, can be used to remove noise from the input data (e.g. removing noise from an image)

• Generative modeling - by forcing the latent-space representation to roughly stick to a unit gaussian distribution, it is possible to sample a new latent vector from the unit gaussian and pass it to the decoder. This process effectively ‘generates’ a completely new data output, one that the network never had seen before.

### Backpropagation

Backpropagation is shorthand for “the backward propagation of errors” and is the main algorithm used to calculate the partial derivative of the error for each parameter in a neural network.

In other words, the algorithm computes the gradient over all parameters which is used by gradient descent to determine how to update the weights of the network, in order to achieve a lower loss. Backpropagation can be thought of as an implementation of the chain rule of derivatives for computation graphs.

### Batch

A batch is a fixed number of examples used in one training iteration during the model training phase.

Batch gradient descent is an implementation of gradient descent which computes the real gradient of the loss function by taking into account all the training examples.

In practice, batch gradient descent is rarely used for deep learning applications, because calculating the real gradient from all the training examples 1) requires to store the entire training set in the processor’s cache memory (which is often not feasible) and 2) it’s slow. Instead, methods that approximate the real gradient like stochastic gradient descent or mini-batch (stochastic) gradient descent are used.

### Batch normalization

Batch normalization standardizes the output of a previous activation layer by subtracting the batch mean and dividing by the batch standard deviation.

This helps speed up the training of the network. It also makes the network learn to generalize, as it introduces a small controlled amount of noise in the inputs of the subsequent layer. The effect on the network of the latter is similar to those of dropout.

Use batch normalization with convolutional neural networks after a convolutional layer (or after a convolutional + pooling layer).

### Bias

Bias can have three meanings:

• Biased data (ethics) - data that inherently favors and/or is detrimental to things or (group of) people. This bias can be introduced, intentionally or not, during the data creation, collection or usage stage.

• Bias model - see separate entry of the same name

• Bias term - in a mathematical formula, the bias term is its offset from the origin and is often referred to as b or wo in the deep learning terminology.

### Binary crossentropy

Binary crossentropy is a loss function used on problems involving yes/no (binary) decisions.

$L(y, \hat{y}) = -\frac{1}{N} \sum_{i=0}^{N}(y * \log({\hat{y}}_i) + (1-y) * (1-{\hat{y}}_i))$

where ŷ is the predicted expected value and y is the observed value.

Binary crossentropy measures how far away from the true value (which is either 0 or 1) the prediction is for each of the classes and then averages these class-wise errors to obtain the final loss.

You use binary crossentropy on multi-label problems.
Use binary crossentropy together with the sigmoid activation function.

### Block

A block is the basic building unit in the Peltarion Platform. They represent the basic components of a neural network and/or the actions that can be carried out on them.

### Categorical crossentropy

Categorical crossentropy is a loss function that is used for single label classification. This is when only one category is applicable for each data point. In other words, an example can belong to one class only.

$L(y, \hat{y}) = -\sum_{j=0}^{M}\sum_{i=0}^{N}(y_{ij} * \log({\hat{y}}_{ij}))$

where ŷ is the predicted expected value and y is the observed value.

Categorical crossentropy will compare the distribution of the predictions (the activations in the output layer, one for each class) with the true distribution, where the probability of the true class is set to 1 and 0 for the other classes. To put it in a different way, the true class is represented as a one-hot encoded vector, and the closer the model’s outputs are to that vector, the lower the loss.

Use categorical crossentropy in classification problems where only one result can be correct.
Use categorical crossentropy together with the softmax activation function.

### Categorical feature

A categorical feature is an input variable that has a discrete set of possible values.

Example: If your variable is season the possible values it can take are Winter, Spring, Summer and Autumn.

### Class

A class is a group to which a specific example can belong to. For example, in the multi-classification model we built in the 'Classifying images of clothes' tutorial, the classes are ‘Ankle boot’, ‘T-shirt’, ‘Dress’, ‘Pullover’, etc. In a classification model, a class is your target i.e., what you want your model to predict. Also, classes appear in the dataset as a categorical feature.

### Class weighting

Class weighting is the inclusion of a coefficient in the loss function calculation, to improve single-label classification results on imbalanced datasets. This coefficient scales the error of each training example inversely to the frequency of its target category.

Similarly to oversampling and undersampling, class weighting prevents models from achieving artificially high accuracy by only learning which class is the most frequent, i.e., the most likely to be presented to the model.

For instance, on an imbalanced dataset containing 900 examples of class A and 100 images of class B, a classification model that always predicts A would achieve a relatively low loss and a 90% accuracy. By scaling up the error of misclassified B examples, class weighting pushes models to learn a better representation of each class.

### Classification

Classification is the process through which a trained model predicts (or assigns) one or several classes for one example. If the model is constrained to predict precisely one class for each example it is a single label classification model. If the model can assign each example to several classes it is a multi label classification model.

To train the model, a training set needs to have multiple labeled examples for each of the desired prediction classes.

#### Multi-label classification

Multi-label classification is a variant of classification where multiple classes (labels) may be assigned to each example.
Example: You want to know which out of a number of objects that are present in an image, where there can be both a dog and a person and a car in the same image.

#### Single-label classification

Single-label classification is a variant of classification where precisely one class (label) is assigned to each example.
Example: Images of skin lesions is classified into either benign or malignant, since a lesion cannot be both at the same time.

### Clustering

Clustering is the process through which an algorithm tries to group data points into different clusters based on some similarity between them.

Unlike classification, no model is trained beforehand to predict predefined classes (in other words, no training examples are provided). Instead, a clustering algorithm discovers by itself what relationships are found in the data and groups the individual data points based on this. This is an example of unsupervised learning.

### Concatenate

Concatenate takes any number of inputs that have the same shape, and merges them along a specified axis. The output has the same size as the sum of all sizes of the input along the axis to concatenate, and the same shape for all other axes, if any.

Note that the input order is important.

Concatenating can be useful to merge features coming from different parts of the model towards the end (for instance merging image features with tabular data when estimating the price of a house), or joining together multiple images that are tiles of a bigger map. Figure 2. Concatenate block with 5 inputs.

Read more about how you can use the Concatenate block on the Platform here.

### Confusion matrix

A confusion matrix helps to illustrate what kinds of errors a classification model is making.

If you have a binary classifier model that distinguishes between a positive and a negative class, you can define the following 4 values depending on the actual vs predicted class. The resulting matrix has 4 fields known as:

Different combinations of these fields result in a number of key metrics, including: accuracy, precision, recall, specificity and f1 score.

The confusion matrix is a compact but very informative representation of how your classification model is performing. It is the go to tool for evaluating classification models.

The confusion matrix is used for binary or multiclass, single label, classification problems.

### Convolution operation (convolve)

In the context of deep learning a convolution operation is a mathematical operation between a matrix (which usually encodes an image) and another smaller matrix called the filter, where each element of the matrix is multiplied by the filter as follows 1: For a visualization of how a filter affects a matrix, we recommend this excellent demo.

For a detailed explanation of convolutions in the context of Deep Learning see Christopher Olah’s excellent article or Dumoulin, V., Visin, F. (2016). A guide to convolution arithmetic for deep learning. arXiv:1603.07285v2, a fantastic paper.

1 Technically this is actually called cross-correlation, which is very closely related to convolutions.

### Convolutional layer

A convolutional layer applies the convolution operation between its filters and its input. On the Peltarion Platform, you can find the 2 dimensional implementation of a convolutional layer under the name of '2D Convolution.'

This type of layer helps you take advantage of the spatial information in your data e.g., the relationships between adjacent pixels in an image (this reflects the intuition that features in an image are often not dependent on the position). By applying the same filter to all positions of an image, they can help reduce the number of parameters of your model compared to fully connected layers.

Use convolutional layers on data where the spacial relationships of the data are important, like images. In theory, you can also try using convolutional layers on sequential or time series data, but LSTM units are more tailored for this data.

### Convolutional neural network

A convolutional neural network is a type of neural network that makes use of convolutional, pooling and dense layers.

They are tailored to take advantage of the spacial information in your data e.g., the relationships between adjacent pixels in an image (this reflects the intuition that features in an image are often not dependent on the position). By convolving the input data with the filters of the convolutional layers and subsequently pooling their outputs with pooling layer, convolutional neural networks are able to reduce the number of parameters that your model needed to learn from the data, when compared to fully connected layers.

Use convolutional neural networks when you’re working with image data.

### Dataset

A dataset is a collection of data. A machine learning dataset will contain features that are used to predict the desired targets. Each entry in the dataset is an example that is used to either train, validate or test the model.

### Deconvolution

Deconvolution performs an opposite operation to the convolution operation. This is also referred to as transposed convolution, which better reflects the actual mathematical operation.

The deconvolution operation is used when we transform the output of a convolution back into a tensor that has the same shape as the input of that convolution, which is useful if we want a model that transforms images into other images, instead of just giving categorical or scalar predictions. Convolution is not itself an invertible operation, which means we cannot simply go from output back to the input. Deconvolution layers have instead to learn weights in the same way as convolution layers.

Autoencoders and image segmentation models are examples of models that make use of deconvolutions.

### Deconvolutional layer

A deconvolutional layer applies the deconvolution (or more correctly the transposed convolution operation) between its filters and its input. On the Peltarion Platform, you can find the 2 dimensional implementation of a deconvolutional layer under the name of '2D Deconvolution.'

This type of layer helps you transform the output of a convolution back into a tensor that has the same shape as the input of that convolution, which is useful if we want a model that transforms images into other images, instead of just giving categorical or scalar predictions.

Autoencoders and image segmentation models are examples of models that make use of deconvolutional layers.

### Dense (or fully connected) layer

In a dense layer every node of the layer is connected to every node in the subsequent layer. Thus, it feeds all outputs from the previous layer to all its nodes, each of which provides one output to the subsequent layer. On the Peltarion Platform this layer is represented by the dense block.

A dense layer is the most basic layer of a neural network. The combination of more than 1 of these layers results in the classsical type of neural network or multilayer perceptorn (MLP). They are very flexible in nature and can in general learn almost any mapping from inputs to outputs.

Use dense layers to build complete networks that are used for classification or prediction problems on tabular data. You can also use dense layers as the last layer(s) before the convolutional layers or recurrent neural networks.

### Dimensionality reduction

Dimensionality reduction is the process of converting your data into a low(er) dimensional representation, while retaining as much information as possible. In practice this implies identifying the features in your dataset that are the most important for the model to achieve the desired objective (a.k.a., the principal variables), and then either discarding or summarizing the remaining features in terms of the principal variables.

Dimensionality reduction is useful to "reduce the amount of unnecessary information" in your data, which can help speed up model training and in some cases help reduce overfitting.

Use dimensionality reduction on datasets that have a large number of features as part of your data preprocessing step. Dimensionality reduction is also used in various embedding techniques.

### Dropout

Dropout is a regularization technique which randomly zeros out (i.e. “drops out”) some of the weights in a given layer during each training iteration. The effects of this can be interpreted as training different neural networks during each training iteration. The final trained network is thus analogous to the ‘average’ of all the networks seen during training.

### Embedding

Embeddings allow you to turn features into fixed-size vectors of real numbers. The key aspects of embeddings are that they:

• Map data to a vector of a given dimension

• Are trained so that numerical relationships between embeddings (e.g., Euclidean distance, cosine similarity) express meaningful relationships inside the data

Use an Embedding block to learn embeddings for categorical feature inputs.

Embeddings can be created for image and text data by using an appropriate processing model, like MobileNetV2 or USE Embedding, and extracting an intermediate output.
Such embeddings contain semantic information about the data content, and can be used for similarity search and data clustering.

### Epoch

An epoch represents a full pass over the entire training set, meaning that the model has seen each example once. An epoch is thus the total number of examples / batch size number of training iterations.

### Error

Error is the numeric value that represents how different the predicted output of your model is when compared to the expected output. It is the essential building block of a loss function.

### Example

An example is an entry in your dataset which holds values for each of the features of your dataset and possibly a label as well. Usually an example is a row in your dataset and is represented as a vector of features. An example specifies what the output layer should do, given the features. The examples in your dataset should be representative of the data you expect to have available when you deploy the model and which you will use to predict the desired output.

The exploding gradient problem is the phenomenon of the gradients calculated by gradient descent progressively accumulating with each training iteration.

This means that the weights of the nodes are changed by exponentially growing amounts, making the network become unstable and thus, unable to be trained. The gradient values can become so large that they effectively are unable to be computed, at which point the model crashes.

### F1-score

In theory a good model (one that makes the right predictions) will be one that has both high precision, as well as high recall. In practice however, a model has to make compromises between both metrics. Thus, it can be hard to compare the performance between a model with high recall and low precision versus a model with low recall and high precision.

F1-score is a metric that summarizes both precision and recall in a single value, by calculating their harmonic mean, which allows it to be used to compare the performance across different models. It’s defined as:

$F_1 \text{score} = \frac{2 * (\text{Precision} * \text{Recall})}{\text{Precision} + \text{Recall}}$

### Fall-out

Assume a dataset that includes examples of a class 'A' and examples of class 'B' (where 'B' can stand in for a single or multiple classes). Assume further, that you’re evaluating your model’s performance to predict examples of class 'A'.

Fall-out is the proportion of examples of class 'B' that was predicted as class 'A', with respect to all examples of class 'B'. In other words, the higher the value of fall-out, the more examples of class 'B' will be misclassified. It’s defined as:

$\text{Fall-out} = \frac{\text{False positive}}{\text{False positive} + \text{True Negative}}$

### False Negatives (FN)

Assume a dataset that includes examples of a class 'A' and examples of class 'B' (where 'B' can stand in for a single or multiple classes). Assume further, that you’re evaluating your model’s performance to predict examples of class 'A'.

False negatives is a field in the confusion matrix which shows the cases when the actual class of the example was 'A' and the predicted class for the same example was 'B'.

### False Positive (FP)

Assume a dataset that includes examples of a class 'A' and examples of class 'B' (where 'B' can stand in for a single or multiple classes). Assume further, that you’re evaluating your model’s performance to predict examples of class 'A'.

False positives is a field in the confusion matrix which shows the cases when the actual class of the example was 'B' and the predicted class for the same example was 'A'.

See fall-out.

### Feature

A feature is an input variable. It can be numeric or categorical. For example, a house can have the following features: number of rooms (numeric), year built (numeric), neighbourhood (categorical), street name (categorical), etc. Multiple features are usually grouped in a Combined feature.

### Feature set

A feature set is features bundled together. A feature set is used as input or output in a model. This is a Peltarion specific concept.

### Filter (convolution)

Filters are the main component of a convolutional layer. They are the 'windows' that slides over the input data to perform the convolution operation with the data coming into the convolutional layer.

Mathematically, the filters are nothing more than a set of weights. What values these weights should take is what the convolutional layer learns during training.

### Generalization

Generalization is the ability of a model to perform well on data / inputs it has never seen before. The goal of any model is to learn good generalization from the training examples, in order to later on perform well on examples it has never seen before. Generalization can thus be thought of as the notion of how well a model has characterized and encoded the signal found in the training examples.

A good indication that a model generalizes well is when the training loss is as small as possible, while at the same time keeping the gap between the training and validation loss as small as possible. The final confirmation that the model has generalized well comes from when the test loss is also low, meaning that the model performs well on never before seen examples.

### Glorot uniform initialization

Glorot uniform initialization (also known as xavier initialization) is a technique used to initialize the weights of your model by assigning them values from the following unifrom distribution:

$W \text{\textasciitilde} U \Bigg[- \frac{\sqrt{6}}{\sqrt{n_{in} + n_{out}}}, \frac{\sqrt{6}}{\sqrt{n_{in} + n_{out}}}\Bigg]$

where ni is the number of units in layer i.

Glorot uniform initialization helps keep the variance of the inputs, outputs, activations and gradients the same across all the layers of a network, which helps to prevent that activations and the gradient updates vanish or explode in size.

Use glorot uniform intialization with dense layers, convolution layers and LSTM units.

Gradient descent is the basic algorithm used to minimize the loss based on the training set. It’s an iterative process through which the parameters of your model are adjusted and thereby gradually finding the best combination to minimize the loss. It does this by computing the gradient (or the ‘slope’) of the loss function and then ‘descending’ down it (or taking a step down the ‘slope’) towards a lower loss value.

Gradient norm indicates how much your model’s weights are adjusted during training. If this is high, it means that the weights are being adjusted a lot. If it’s low, it indicates that the model might have reached a local optimum.

### Hidden layer

A hidden layer is any layer in a neural network between the input layer and the output layer.

### Hyperparameter

A hyperparameter is any parameter of a model or training process that has to be set / fixed before starting the training process, i.e., they are not automatically adjusted during the training. Examples of hyperparameters include: drop rate (dropout), batch size, learning rate, number of layers, number of filters, etc.

### Imbalanced dataset

An imbalanced dataset is a dataset that contains a very different amount of examples for each of its classes.
Training on such datasets generally leads to biased models, since each class affects the loss proportionally to its frequency.

A common way to improve results on infrequent classes is to create a new dataset and balance it by oversampling or undersampling examples.
Class weighting can also be used if the imbalance occurs in a categorical feature which is the model’s target; for instance, when learning to identify normal and anomalous classes from a dataset having many more normal examples than abnormal ones.

### Input

Input is the series of examples fed to a layer.

### Input layer

The input layer is the first layer of a neural network and is the one which take the individual examples of your training set as input to the model. On the Peltarion Platform this layer is represented by the Input block. You can think of the Input block as a 'placeholder' for the data that is going to be fed into the model.

All your models will need an input layer (i.e.,Input block).

### Kernel (convolution)

See filter (convolution)

### Label

A label is a class (in classification) or a target (in regression) assigned to examples in a dataset. When training a model, labels are how you say what you want them to predict. For example, in image classification: "See an apple, say 'apple'"

### Learning

Learning, in the context of deep learning, is the process of automatically setting (i.e. learning) a model for a specific dataset through the use of statistical and mathematical optimization techniques.

### Learning rate

The learning rate is a hyperparameter of gradient descent. It’s a scalar which controls the size of the update steps along the gradient.

Choosing the right learning rate is crucial for optimal gradient descent and thus, for optimal training. Too small and gradient descent will only take small steps in each iteration, meaning model training will be slow. Too big and gradient descent will take too large of a step, potentially ending up 'bouncing around' the loss surface, making training unstable.

### Linear activation function

The linear activation_function is a straight line function where the node’s activation is proportional to the input.

Use in the last layer of a regression model, if the output variable does not have known upper and lower bounds.

Range: -∞ to +∞

$f(x) = c * x$

### Loss

Loss is a measure of how well your algorithm models your dataset. It’s a numeric value which is computed by the loss function. The lower the loss, the better the performance of your model.

### Loss function

A loss function is a function that determines how errors are penalized. The goal of the loss function is to capture in a single number the total amount of errors across all training example.

### LSTM unit

A long short term memory unit is a special kind of recurrent neural network building block that has a build in ability to 'remember' or 'forget' parts of sequential data. This ability allows a RNN using LSTM units to learn very long range connections in sequential data, by keeping relevant information 'stored' in the unit.

### Mean absolute error (MAE)

Mean absolute error (MAE) is a loss function used for regression. The loss is the mean over seen data of the absolute differences between true and predicted values, or writing it a a formula:

$L(y, \hat{y}) = \frac{1}{N} \sum_{i=0}^{N}|y - {\hat{y}}_i|$

where ŷ is the predicted expected value and y is the observed value.

MAE is not sensitive towards outliers and given several examples with the same input feature values, the optimal prediction will be their median target value. This should be compared with mean squared error, where the optimal prediction is the mean. A disadvantage of MAE is that the gradient magnitude is not dependent on the error size, only on the sign of yi - ŷi. This leads to that the gradient magnitude will be large even when the error is small, which in turn can lead to convergence problems.

Use mean absolute error when you are doing regression and don’t want outliers to play a big role. It can also be useful if you know that your distribution is multimodal, and it’s desirable to have predictions at one of the modes, rather than at the mean of them. MAE is can also be used as a performance metric, since it’s easy to interpret.

### Mean squared error (MSE)

Mean squared error (MSE) is the most commonly used loss function for regression. The loss is the mean over seen data of the squared differences between true and predicted values, or writing it as a formula:

$L(y, \hat{y}) = \frac{1}{N} \sum_{i=0}^{N}(y - {\hat{y}}_i)^2$

where ŷ is the predicted expected value and y is the observed value.

Minimizing MSE is equivalent of maximizing the likelihood of the data under the assumption that the target comes from a normal distribution, conditioned on the input.

MSE is sensitive towards outliers and given several examples with the same input feature values, the optimal prediction will be their mean target value. This should be compared with mean absolute error, where the optimal prediction is the median. MSE is thus good to use if you believe that your target data, conditioned on the input, is normally distributed around a mean value, and when it’s important to penalize outliers extra much.

Use MSE when doing regression, believing that your target, conditioned on the input, is normally distributed, and want large errors to be significantly (quadratically) more penalized than small ones.

### Mean squared logarithmic error (MSLE)

Mean squared logarithmic error (MSLE) is, as the name suggests, a variation of the Mean Squared Error. The loss is the mean over the seen data of the squared differences between the log-transformed true and predicted values, or writing it as a formula:

$L(y, \hat{y}) = \frac{1}{N} \sum_{i=0}^{N}(\log(y_i + 1) - \log({\hat{y}}_i + 1))^2$

where ŷ is the predicted expected value and y is the observed value.

This loss can be interpreted as a measure of the ratio between the true and predicted values, since:

$\log(y_i + 1) - \log({\hat{y}}_i + 1) = log\Bigg(\frac{y_i + 1}{{\hat{y}}_i + 1}\Bigg)$

The introduction of the logarithm makes MSLE only care about the relative difference between the real and the predicted value, or in other words, it only cares about the porcentual difference between them. This means that MSLE will treat small differences between small true and predicted values approximately the same as big differences between large true and predicted values. MSLE also penalizes underestimates more than overestimates, introducing an asymmetry in the error curve.

Use MSLE when doing regression, believing that your target, conditioned on the input, is normally distributed, and you don’t want large errors to be significantly more penalized than small ones, in those cases where the range of the target value is large.

### Min-Max Normalization

Min-max normalization performs a linear rescaling of your data. It transforms the lowest value of an input feature to 0, and the highest value 1. Every other value in between the min and max values is transformed to a decimal between 0 and 1.

$z_i = \frac{x_i - min(x)}{max(x) - min(x)}$

Applied across your Combined feature, it will guarantee that all values are between 0 and 1, which makes it easier for a network to train since all the values are within the same range. It’s also a computationally efficient way to normalize your features when they have a bounded range of values and when they don’t have large outliers.

Use min-max normalization to normalize your data when it’s bounded (e.g., pixel values in an image) and when it doesn’t contain any large outliers. (If your dataset has large outliers use standardization instead).

### Mini-batch

A batch can be subdivided into smaller mini-batches.

Mini-batches are used as a way to speed up gradient descent through mini-batch gradient descent.

Mini-batch gradient descent is an implementation of gradient descent which approximates the real gradient of the loss function, which is computed by taking into account all the training examples, with an approximated gradient, which is calculated by iteratively taking into account all the training examples in a mini-batch, until it has gone through all mini-batches.

Mini-batch gradient descent tries to take the best of batch gradient descent and stochastic gradient descent. It’s much faster than batch gradient descent, but avoids the extreme fluctuations in the parameter updates of SGD (i.e., it’s more stable than SGD), thus making model training faster without any significant downsides.

Mini-batch gradient descent is the algorithm of choice for training deep learning models.

 Note The term SGD is usually used to both describe the single example and the mini-batch method, since the difference between both is only in the batch size.

### Model

Model can have two meanings:

• It’s the combination of your neural network architecture and your specific hyperparameter settings. On the Peltarion Platform, a model is a sequence of blocks that have been strung together to achieve the desired model architecture.

• It’s the (mathematical) representation that the neural network has learned after having been trained on the training set.

### Momentum

Momentum compares the gradient of the previous iteration with the gradient of the current iteration and then taking bigger steps for the dimensions for which the gradients point in the same direction and smaller steps for the dimensions in which they don’t. In other words, SGD 'gains momentum' in those directions where the gradient (or the 'slope') are pointing in the same direction in subsequent iterations. In the analogy of gradient descent being equal to a ball rolling down a hill, momentum would be equal to adding 'inertia' to the ball or similarly using a heavier ball to go down the same hill.

Momentum helps dampen the fluctuations of SGD and helps it accelerate towards the relevant gradient direction.

### Multilayer Perceptron

A multilayer perceptron (also known as a feedforward neural network) is a type of neural network that only makes use of dense layers. It is the classical type of neural network.

MLPs are very flexible in nature and work well on any number of prediction or classification problems that involved tabular data.

Use MLPs when you’re working with tabular data.

### Natural language processing (NLP)

Natural language processing (NLP) techniques aim to automatically process, analyze and manipulate (large amounts) of language data like speech and text.

Nesterov accelerated gradient is a modification of momentum which introduces and additional term that 'looks ahead' at the upcoming iteration and approximates the expected parameter update. It uses this approximation to tune the amount of 'momentum that momentum imparts on SGD. If the parameters of the upcoming iteration show that the gradient (or 'slope') is increasing, NAG will reduce the amount of 'momentum' in anticipation.

NAG helps dampen the fluctuations of SGD and helps it accelerate towards the relevant gradient direction. It’s anticipatory nature results in increased performance of SGD on complex loss surfaces.

### Neural network

A neural network is composed of a large number of processing units called nodes that are highly interconnected with each other and arranged in structures called layers. These networks of nodes are able to process information in such a way that is able to solve problems such as pattern recognition and data classification through a learning process.

See Node

### Node

A node is the basic computation unit of a neural network. It takes the weighted sum of all of its inputs and feeds the result to an activation function to produce a single output.

### Noise

Noise is anything that is not the signal. Thus, noise in this sense doesn’t refer to the everyday notion of noise, like "noise in a photo caused due to poor lighting conditions". Instead it’s the abstract notion of any information contained in the data that is not relevant to modelling the relationships between the input data and the target one wishes to learn.

### Normalization (concept)

Normalization is the process of ‘resizing’ values (e.g., the outputs of a layer) from their actual numeric range into a standard range of values.

This process makes all the values across features be more consistent with each other, which can be interpreted as making all the values across features of equal importance. This helps speed up the training of the network.

You should always normalize your data before you start training your network.

### One-hot encoding

One-hot encoding is a method which allows you to convert a categorical feature into a binary vector. For example, if your variable is ‘apple_color’ and the possible values it can take are ‘Red’, ‘Yellow’ and ‘Green’, the feature values can be encoded as follows: One-hot encoding is a simple form of embedding. Categorical encoding in the Datasets view is the same thing as one-hot encoding

Many deep / machine learning algorithms require their input and output data to be numeric. By transforming a categorical feature into a numeric value using one-hot encoding, it can be used by a deep / machine learning algorithm, either as a feature or target value, to train a model.

Use categorical encoding with categorical features in your dataset (e.g., labels) that have a relatively small number of categories.

An optimizer is a specific implementation of gradient descent which improve the performance of the basic gradient descent algorithm.

Optimizers aim to mitigate some of the challenges that are characteristic of gradient descent like: convergence to suboptimal local minima and setting the starting learning rate and it’s decay.

In practice, using a gradient descent optimizer is the go-to choice for training deep learning models.

### Output layer

The output layer is the last layer of a neural network and is the one which returns the results of your model (e.g., the class in a classification model or a value in a prediction model). On the Peltarion Platform this layer is represented by the input block.

All your models will need an output layer (i.e., output block).

### Overfitting

Overfitting is the phenomenon of a model not performing well, i.e., not making good predictions, because it captured the noise as well as the signal in the training set. In other words, the model is generalizing too little and instead of just characterizing and encoding the signal it’s encoding too much of the noise found in the training set as well. (Another way to think about this is that the model is trying to fit 'too much' to the training data).

This means that the model performs well when it’s shown a training example (resulting in a low training loss), but badly when it’s shown a new example it hasn’t seen before (resulting in a high validation loss).

### Oversampling

It’s the process of balancing a dataset by reusing examples of the underrepresented classes so that every class of the dataset has an equal amount of examples.

A balanced dataset allows a model to learn equal amounts of characteristics from each of the classes represented in the dataset, as opposed to one class dominating what the model learns.

Use when you have an imbalanced dataset. Oversampling is usually preferred to undersampling as data is rarely overabundant.

### P

Padding is the process of adding one or more pixels of zeros all around the boundaries of an image, in order to increase its effective size.

Convolutional layers return by default a smaller image than the input. If a lot of convolutional layers are strung together, the output image is progressively reduced in size until, eventually, it might become unusable. By padding an image (i.e., "increasing" its size) before a convolutional layer, this effect can be mitigated. The relationship between padding and the output of a convolutional layer is given by:

$\omicron = \bigg(\frac{i + 2p - k}{s}\bigg) + 1$

### Parameter

A parameter is any internal variable of your model that is automatically adjusted during training in order to minimize the loss function. An example of parameters are each of the weights in a neural network.

### Poisson loss

The poisson loss is a loss function used for regression when modelling count data. The loss takes the form of:

$L(y, \hat{y}) = \frac{1}{N} \sum_{i=0}^{N}({\hat{y}}_i - y_ilog{\hat{y}}_i)$

where ŷ is the predicted expected value, and y is the observed value.

Minimizing the poisson loss is equivalent of maximizing the likelihood of the data under the assumption that the target comes from a poisson distribution, conditioned on the input.

The poisson loss is a specifically tailored for data follows the poisson distribution. Examples of this are number of customers that will enter a store on a given day, number of emails that will arrive within the next hour, or how many customers that will churn next week.

Use the poisson loss when you believe that the target value comes from a poisson distribution and want to model the rate parameter conditioned on some input.

### Pooling

Pooling is the process of summarizing or aggregating sections of a given data sample (usually the matrix resulting from a convolution operation) into a single number. This is usually done by either taking the maximum or the average value of said sections.

This operation helps you reduce the number of parameters of your model and introduces translational invariance in the features extracted by your model (i.e., your model will be less sensitive to small translations of the input data).

### Pooling layer

A pooling layer applies the pooling operation to its input (usually the output of a convolutional layer). On the Peltarion Platform, you can find the three kinds of pooling layers: '2D Max pooling', '2D Average pooling', 'Global Average pooling'

This type of layer helps you reduce the number of parameters of your model and introduces translational invariance in the features extracted by your model (i.e., your model will be less sensitive to small translations of the input data).

Use after a <<Glossary#Convolutional_layer,convolutional layer>>.

See precision.

### Precision

Assume a dataset that includes examples of a class 'A' and examples of class 'B' (where 'B' can stand in for a single or multiple classes). Assume further, that you’re evaluating your model’s performance to predict examples of class 'A'.

This metric is the proportion of examples of class 'A' that are correctly predicted as class 'A', with respect to all examples predicted as class 'A'. In other words, the higher precision, the fewer examples of class 'A' will be misclassified. It’s defined as:

$\text{Precision} = \frac{\text{True positive}}{\text{True positive} + \text{False positive}}$

### Random uniform initialization

Random uniform initialization is a technique used to initialize the weights of your model by assigning them values from a uniform distribution with zero mean and unit variance.

Random uniform initialization helps generate values for your weights that are simple to understand intuitively.

This used to be the standard procedure to initialize weights, but it has been superseded by other weight initialization techniques (like glorot uniform initialization). Use random uniform initialization with dense layers, convolutional layers and LSTM units, but be aware that it may cause your network not to train as effectively as with more modern choices.

### Recall

Assume a dataset that includes examples of a class 'A' and examples of class 'B' (where 'B' can stand in for a single or multiple classes). Assume further, that you’re evaluating your model’s performance to predict examples of class 'A'.

This metric is the proportion of examples of class 'A' that are correctly predicted as class 'A', with respect to all examples of class 'A'. In other words, the higher recall, the fewer examples of class 'A' we will miss. It’s defined as:

$\text{Recall} = \frac{\text{True positive}}{\text{True positive} + \text{False negative}}$

### Recurrent neural network (RNN)

A recurrent neural network is a type of neural network that makes use of LSTM units and dense layers.

They are tailored to take advantage of sequential information e.g., the relationships between words in a sentence or notes in music.

Use recurrent neural networks when you’re working with sequential data.

### Regression predictions

Regression prediction is the process through which a trained model predicts a value or the probability of a target.

To train the model, a training set needs to have multiple labeled examples for the desired prediction target.

### Regularization

Regularization discourages a model from becoming too complex, which in turn prevents the model from overfitting the training set.

Regularization artificially constrains and ultimately reduce the absolute value of the parameters of a model, by affecting the range of the values each can take and the total number of active (i.e., non-zero) parameters.

### ReLU (rectified linear unit) activation function

The ReLU (rectified linear unit) is a non-linear function that gives the same output as input if the input is above 0, otherwise the output will be 0.

It is cheap to compute and works well for many applications. It also helps prevent the vanishing gradients problem.

It is the go-to activation function for many neural networks.

$f(x) = max(x, 0);$

### Reshape

Range: 0 to +∞

Reshape takes a data structure of any shape as input and changes its shape to an user-defined one. The only parameter is the desired shape, expressed as a list of comma separated integers, one for each of the dimensions of the new shape. The product of all dimensions in the list must equal the product of all dimensions in the input.

Reshape is used to transform the shape of your data structure, into the one expected by parts of your model, as these sometimes do not match.

### RMSPROP

RMSprop is an extension of Adagrad that deals with Adagrad’s radically diminishing learning rates. RMSprop divides the learning rate by an exponentially decaying average of squared gradients. Hinton suggests the set the fuzz factor epsilon to 0.9. A good default value for the learning rate is 0.001. This optimizer is usually a good choice for recurrent neural networks (RNN).

### Semantic image segmentation

Semantic image segmentation is a deep learning technique that assigns a label to every pixel of an image with the goal of associating each pixel to a class. It’s semantic because it tries to classify each pixels base on their relationships (e.g., neighboring pixels or pixels of similar color are likely to belong to the same class).

### Sequence length

Sequence length determines how many tokens are included from a text feature in each sequence. The sequence length can be between 3 and 512 tokens.

The unit of length is the token, whose precise definition depends on the Language models selected in the text Feature_encoding.

Input that is less than sequence length becomes padded. Text features larger than Sequence length are cut off. This may discard significant information if the Sequence length is much smaller than the average text feature.

The larger the sequence length, the bigger the example size – which may result in slower training time, or your model exceeding hardware memory limits.

### Sigmoid activation function

The sigmoid activation function generates a smooth non-linear curve that maps the incoming values between 0 and 1.

The sigmoid function works well for a classifier model but it has problems with vanishing gradients for high input values, that is, y change very slow for high values of x. Unlike the softmax activation function, the sum of all the outputs doesn’t have to be 1 when sigmoid is used as an activation function in the output layer. This means that each output node with a sigmoid activation function acts independently on each input, so more than one output node can fire at the same time.

The sigmoid function is often used together with the loss function binary crossentropy.

Use for binary classification or multilabel classification problems.

$f(x) = \frac{1}{1 + e^{-x}}$

### Signal

Range: 0 to 1 - Two classes

Signal is the true, underlying trend or structure that one wants to capture from the data, or equivalently the relationship between the input data and the target values that one wishes to learn.

### Softmax activation function

The softmax activation function will calculate the relative probability of each target class over all possible target classes in the dataset given the inputs it receives. In other words it normalizes the outputs so that they sum to 1, so that they can be directly treated as probabilities over the output.

This is usefull for multiclass classification models, as the target class with the highest probability is going to be the output of the model.

It is often used in the final layer in a classification model with the categorical crossentropy as loss function.

Range: 0 to 1 - Multiple classes
$\sigma(x_j) = \frac{e^{x_j}}{\sum_{k=0}^{K} e^{x_k}}$

### Specificity

Assume a dataset that includes examples of a class 'A' and examples of class 'B' (where 'B' can stand in for a single or multiple classes). Assume further, that you’re evaluating your model’s performance to predict examples of class 'A'.

This metric is the proportion of examples of class 'B' that are correctly predicted as class 'B', with respect to all examples of class 'B'. In other words, the higher specificity, the fewer examples of class 'B' we will miss. It’s defined as:

$\text{Specificity} = \frac{\text{True negative}}{\text{True negative} + \text{False positive}}$

### Squared hinge loss

Squared hinge loss is a loss function used for “maximum margin” binary classification problems. Mathematically it is defined as:

$L(y, \hat{y}) = \sum_{i=0}^{N}\Big(max(0, 1 - y_i \cdotp {\hat{y}}_i)^2\Big)$

where ŷ is the predicted value and y is either 1 or -1.

Thus, the squared hinge loss is:

0:

• when the true and predicted labels are the same and

• |ŷi|≥1 (which is an indication that the classifier is sure that it’s the correct label)

• when the true and predicted labels are not the same or

• when |ŷi|<1, even when the true and predicted labels are the same (which is an indication that the classifier is not sure that it’s the correct label)

 Note ŷ should be the actual numerical output of the classifier and not the predicted label.

The hinge loss guarantees that, during training, the classifier will find the classification boundary which is the furthest apart from each of the different classes of data points as possible. In other words, it finds the classification boundary that guarantees the maximum margin between the data points of the different classes.

A sample use case would be when you want to classify email into ‘spam’ and ‘not spam’ and you’re only interested in the classification accuracy.

### Standardization

Standardization (also known as Z-Score normalize) performs a rescaling of your data so that it has a zero mean and a unit standard deviation. Values above the feature’s mean value will get positive scores, and those below the mean will get a negative score.

$z = \frac{x - \mu}{\sigma}$

Applied across your Combined feature, it will guarantee that your data will be gaussian with mean zero and standard deviation one. This is desirable as:

1. it allows the comparison of features with different units or scales, and

2. the zero centered data offers favorable numeric conditions for training a model.

Note that unlike min-max normalization, standardized data will not have the exact same scale (because the gaussian allows values between -∞ and +∞), but it will handle outliers in your data well.

Use standardization when your data has different units / scales, is not restricted to a range of values and / or when it has large outliers. Note that when you perform a standardization, you make the implicit assumption that the input data are normally distributed. However, standardization typically works well even when this is not the case.

Stochastic gradient descent (SGD) is an implementation of gradient descent which approximates the real gradient of the loss function, which is computed by taking into account all the training examples, with an approximated gradient, which is calculated by iteratively taking a single training example at a time until it has gone through all training examples.

This method is much faster than batch gradient descent, but it doesn’t calculate the real gradient of the loss function, since it only uses one training example at a time. This has the effect that the parameter updates made by SGD can fluctuate significantly (i.e., updates can be unstable and not necessarily correspond to a global pattern in the data), which translates to a model potentially being hard to train. This can be somewhat mitigated by slowly decreasing the learning rate throughout training. SGD is also more computationally expensive than batch gradient descent, since it’s calculating the gradient much more often.

Use SGD for online learning applications.

 Note The term SGD is usually used to both describe the single example and the mini-batch method, since the difference between both is only in the batch size.

### Subset

A subset is a smaller set of your dataset. For the purpose of training a model, you usually subdivide your dataset into three subsets: training set, validation set and test set.

### Supervised learning

Supervised learning is the task of learning a model from a dataset that has labeled examples. In other words, it is the task of learning a model / function that maps the input to the target. Examples include learning models for classification and prediction task.

### Tanh activation function

Tanh is a scaled sigmoid activation function. The gradient is stronger for tanh than sigmoid, that is, the derivatives are steeper.

Unlike the sigmoid function, the tanh function is zero-centered, which means that it dosen’t introduce a bias in the gradients making training a network easier. The downsinde is that tanh is computationally more expensive than the sigmoid function.

Which one to use of the sigmoid or tanh depends on your requirement of gradient strength. Tanh resembles a linear function more as long as the activations of the network can be kept small. This makes the tanh network easier to compute.

$f(x) = \frac{2}{1 + e^{-2x}} - 1$

### Target

Range: -1 to 1 - Two classes

Target represents the desired output that we want our model to learn. In the case of a classification problem, the targets would be the labels of each of the examples in the training set.

### Test set

A test set is a subset of your dataset that is used to check the performance of the model that was learned during training. It consists of a set of examples that the model has never seen, which help confirm its prediction accuracy.

The test set is only used after training is completed and is used to provide a final assessment of the performance of the model. Note that the validation set is not able to do this, since it was used during training to adjust the hyperparameters and/or the architecture of the model.

In short: the test set is used to assess the model’s performance (i.e., generalization and predictive power)

### Trained model

A trained model is a model that has undergone training.

### Training

Training is the process of building a model by setting the ideal parameters through the use of gradient descent applied on the training set.

### Training example

A training example is an example that is included in your training set.

### Training iteration

A training iteration is the process of the model training on a (mini) batch, i.e., a single update of a model’s weights during training.

### Training loss

Training loss is the average loss per training example of your model based on your training set.

### Training set

A training set is a subset of your dataset which contains all the examples available to a neural network to create a model during training. It’s the data that gradient descent runs on, in order to adjust the parameters of the model.

In short: the training set is used to fit the model parameters (i.e., weights).

### True negatives (TN)

Assume a dataset that includes examples of a class 'A' and examples of class 'B' (where 'B' can stand in for a single or multiple classes). Assume further, that you’re evaluating your model’s performance to predict examples of class 'A'.

True negatives is a field in the confusion matrix which shows the cases when the actual class of the example was 'B' and the predicted class for the same example was also 'B'.

See recall.

### True positives (TP)

Assume a dataset that includes examples of a class 'A' and examples of class 'B' (where 'B' can stand in for a single or multiple classes). Assume further, that you’re evaluating your model’s performance to predict examples of class 'A'.

True positives is a field in the confusion matrix which shows the cases when the actual class of the example was 'A' and the predicted class for the same example was also 'A'.

### Underfitting

Underfitting is the phenomenon of a model not performing well, i.e., not making good predictions, because it wasn’t able to correctly or completely capture the signal in the training set. In other words, the model is generalizing too much, to the point that it’s actually missing the signal.

This means that the model doesn’t perform well on training examples (resulting in a high training loss), nor on examples it hasn’t seen before (resulting in a high validation loss).

### Undersampling

It’s the process of balancing a dataset by discarding examples of one or more overrepresented classes so that each has the same amount of examples.

A balanced dataset allows a model to learn equal amounts of characteristics from each one of the classes represented in the dataset, as opposed to one class dominating what the model learns.

Use when you have an imbalanced dataset. Note that oversampling is usually preferred to undersampling as data is rarely overabundant.

### Unsupervised learning

Unsupervised learning is the task of learning a model from a dataset that doesn’t have labeled examples. In other words, it is the task of learning a model that captures the underlying (or hidden / latent) structures and patterns in the dataset. Examples include: clustering and dimensionality reduction.

### Validation example

A validation example is an example that is included in your validation set.

### Validation loss

Validation loss is the average loss per validation example of your model based on your validation set.

### Validation set

A validation set is a subset of your dataset which contains examples available to a neural network to adjust the hyperarameters or the model architecture based on the validation loss.

The validation set is used during training to run validation examples through the model after each epoch, in order to compute the validation loss. A good model (one that generalizes well) is one where the training loss is as small as possible, while at the same time keeping the gap between the trainingand validation loss as small as possible.

If the validation loss is high or starts increasing during early training, training can be stopped to adjust the hyperarameters or the model architecture, in order to improve the model’s performance. Alternatively, if the validation loss starts increasing after it being at a comparatively low level, training can be stopped to prevent the model from overfitting.

In short: the validation set is used to tune the model’s hyperarameters or the model architecture (i.e., learning rate, number of layers, etc.)

The vanishing gradient problem is the phenomenon of the gradients calculated by gradient descent getting progressively smaller when moving backward in the networks from output to input layer.

This means that the weights of the nodes in early layers only change slowly (compared to later layers in a network), which means that they train and hence, learn, very slowly or not at all.

### Weight initialization

Weight initialization is the process of assigning some starting values to the weights of your model, before starting training.

The starting values of the weights have a significant impact on the training of your model. Naïve initialization strategies, like making the initial value of all weights equal to 0, can result in your model not learning anything at all (or in other words, gradient descent is unable to converge). A good weight initialization strategy can also help prevent the vanishing / exploding gradient problem.

Always use a weight initilization strategy with dense layers, convolutional layers and LSTM units.

### Weights

The weights and the operations performed on them by a neural network are the way your model is encoded in the network. A weight is a trainable parameter and it’s value can be thought of as the strength of the connection between different nodes. The higher the weight, the stronger the connection between two nodes or alternatively, the more important that connections is for the model to predict the target.