Glossary
A
Accuracy
Accuracy is a performance metric that allows you to evaluate how good your model is. It’s used in classification models and is the ratio of:
or equivalently:
Activation function
An activation function is a nonlinear function which takes the weighted sum of all the inputs to a node and maps them to values in the range of 0 to 1 (e.g., Sigmoid), 0 to ∞ (e.g. ReLu) or 1 to 1 (e.g., TanH).
Its nonlinear nature is what allows neural networks to model any kind of function (it makes them a universal function approximator).
Use an activation function in every node of a network. On the Peltarion Platform, all nodes are automatically assigned an activation function, which can be changed as one of the configurable parameters of the blocks that have nodes.
Adadelta optimizer
Adadelta is a more robust extension of Adagrad that seeks to reduce its aggressive, monotonically decreasing learning rate based on a fixed moving window of gradient updates, instead of accumulating all past gradients.
It is recommended to leave the parameters of this optimizer at their default values.
Adagrad optimizer
Adagrad is an optimizer that adapts the learning rate to the parameters. Adagrad performs:

Smaller updates (i.e., low learning rates) for parameters associated with frequently occurring features

Larger updates (i.e., high learning rates) for parameters associated with infrequent features.
Adam optimizer
Adam can be viewed as a combination of RMSprop and momentum.
RMSprop contributes the exponentially decaying average of past squared gradients, while momentum accounts for the exponentially decaying average of past gradients.
Adam adds biascorrection and momentum to RMSprop.
Momentum can be seen as a ball rolling down a slope, Adam behaves like a heavy ball with friction, which thus prefers flat minima in the loss function landscape. The authors of the original paper empirically show that Adam works well in practice and compares favorably to other adaptive learningmethod algorithms.
Adamax optimizer
Adamax is an extension of Adam based on the infinity norm.
Add
The Add block can take any number of inputs, all with the same shape, and returns a single tensor of the same shape, containing the elementwise sum over all inputs.
Useful when building residual networks, where the layer’s input is added with an Add node to its output.
Amsgrad optimizer
A variant of Adam based on the algorithm from the paper "On the Convergence of Adam and Beyond".
Area under the ROC Curve (AUC)
The AUC is the area under the ROC curve and is a performance measure that tells you how well your model can classify different classes. The higher the AUC the better the model.
Autoencoder
An autoencoder is a special kind of neural network that aims to copy its input to its output. What makes this a non trivial task is that rather than just directly copying the input to the output, the network first tries to encode the input into a small number of parameters (the latentspace representation) and then tries to reconstruct the original data at the output from it.
The latentspace can be thought of an efficient recoding of the input data. This recoding of the input data can itself be interpreted in different ways, each one of which gives us insights into the applications of autoencoders:

Data compression

Dimensionality reduction / feature extraction  the smaller encoding can be thought of the autoencoder having performed dimensionality reduction (using non linear transformations!) on the original data.

Data denoising  the latentspace’s learned ability to distinguish the noise and to discard it, can be used to remove noise from the input data (e.g. removing noise from an image)

Generative modeling  by forcing the latentspace representation to roughly stick to a unit gaussian distribution, it is possible to sample a new latent vector from the unit gaussian and pass it to the decoder. This process effectively ‘generates’ a completely new data output, one that the network never had seen before.
B
Backpropagation
Backpropagation is shorthand for “the backward propagation of errors” and is the main algorithm used to calculate the partial derivative of the error for each parameter in a neural network.
In other words, the algorithm computes the gradient over all parameters which is used by gradient descent to determine how to update the weights of the network, in order to achieve a lower loss. Backpropagation can be thought of as an implementation of the chain rule of derivatives for computation graphs.
Batch
A batch is a fixed number of examples used in one training iteration during the model training phase.
Batch gradient descent
Batch gradient descent is an implementation of gradient descent which computes the real gradient of the loss function by taking into account all the training examples.
In practice, batch gradient descent is rarely used for deep learning applications, because calculating the real gradient from all the training examples 1) requires to store the entire training set in the processor’s cache memory (which is often not feasible) and 2) it’s slow. Instead, methods that approximate the real gradient like stochastic gradient descent or minibatch (stochastic) gradient descent are used.
Batch normalization
Batch normalization standardizes the output of a previous activation layer by subtracting the batch mean and dividing by the batch standard deviation.
This helps speed up the training of the network. It also makes the network learn to generalize, as it introduces a small controlled amount of noise in the inputs of the subsequent layer. The effect on the network of the latter is similar to those of dropout.
Use batch normalization with convolutional neural networks after a convolutional layer (or after a convolutional + pooling layer).
Bias
Bias can have three meanings:

Biased data (ethics)  data that inherently favors and/or is detrimental to things or (group of) people. This bias can be introduced, intentionally or not, during the data creation, collection or usage stage.

Bias model  see separate entry of the same name

Bias term  in a mathematical formula, the bias term is its offset from the origin and is often referred to as b or w_{o} in the deep learning terminology.
Binary crossentropy
Binary crossentropy is a loss function used on problems involving yes/no (binary) decisions.
where ŷ is the predicted expected value and y is the observed value.
Binary crossentropy measures how far away from the true value (which is either 0 or 1) the prediction is for each of the classes and then averages these classwise errors to obtain the final loss.
You use binary crossentropy on multilabel problems.
Use binary crossentropy together with the sigmoid activation function.
Block
A block is the basic building unit in the Peltarion Platform. They represent the basic components of a neural network and/or the actions that can be carried out on them.
C
Categorical crossentropy
Categorical crossentropy is a loss function that is used for single label classification. This is when only one category is applicable for each data point. In other words, an example can belong to one class only.
where ŷ is the predicted expected value and y is the observed value.
Categorical crossentropy will compare the distribution of the predictions (the activations in the output layer, one for each class) with the true distribution, where the probability of the true class is set to 1 and 0 for the other classes. To put it in a different way, the true class is represented as a onehot encoded vector, and the closer the model’s outputs are to that vector, the lower the loss.
Use categorical crossentropy in classification problems where only one result can be correct.
Use categorical crossentropy together with the softmax activation function.
Categorical feature
A categorical feature is an input variable that has a discrete set of possible values.
Example: If your variable is season
the possible values it can take are Winter
, Spring
, Summer
and Autumn
.
Class
A class is a group to which a specific example can belong to. For example, in the multiclassification model we built in the 'Classifying images of clothes' tutorial, the classes are ‘Ankle boot’, ‘Tshirt’, ‘Dress’, ‘Pullover’, etc. In a classification model, a class is your target i.e., what you want your model to predict. Also, classes appear in the dataset as a categorical feature.
Class weighting
Class weighting is the inclusion of a coefficient in the loss function calculation, to improve singlelabel classification results on imbalanced datasets. This coefficient scales the error of each training example inversely to the frequency of its target category.
Similarly to oversampling and undersampling, class weighting prevents models from achieving artificially high accuracy by only learning which class is the most frequent, i.e., the most likely to be presented to the model.
For instance, on an imbalanced dataset containing 900 examples of class A
and 100 images of class B
, a classification model that always predicts A
would achieve a relatively low loss and a 90% accuracy.
By scaling up the error of misclassified B
examples, class weighting pushes models to learn a better representation of each class.
Classification
Classification is the process through which a trained model predicts (or assigns) one or several classes for one example. If the model is constrained to predict precisely one class for each example it is a single label classification model. If the model can assign each example to several classes it is a multi label classification model.
To train the model, a training set needs to have multiple labeled examples for each of the desired prediction classes.
Multilabel classification
Multilabel classification is a variant of classification where multiple classes (labels) may be assigned to each example.
Example: You want to know which out of a number of objects that are present in an image, where there can be both a dog and a person and a car in the same image.
Singlelabel classification
Singlelabel classification is a variant of classification where precisely one class (label) is assigned to each example.
Example: Images of skin lesions is classified into either benign or malignant, since a lesion cannot be both at the same time.
Clustering
Clustering is the process through which an algorithm tries to group data points into different clusters based on some similarity between them.
Unlike classification, no model is trained beforehand to predict predefined classes (in other words, no training examples are provided). Instead, a clustering algorithm discovers by itself what relationships are found in the data and groups the individual data points based on this. This is an example of unsupervised learning.
Concatenate
Concatenate takes any number of inputs that have the same shape, and merges them along a specified axis. The output has the same size as the sum of all sizes of the input along the axis to concatenate, and the same shape for all other axes, if any.
Note that the input order is important.
Concatenating can be useful to merge features coming from different parts of the model towards the end (for instance merging image features with tabular data when estimating the price of a house), or joining together multiple images that are tiles of a bigger map.
Read more about how you can use the Concatenate block on the Platform here.
Confusion matrix
A confusion matrix helps to illustrate what kinds of errors a classification model is making.
If you have a binary classifier model that distinguishes between a positive and a negative class, you can define the following 4 values depending on the actual vs predicted class.
The resulting matrix has 4 fields known as:

True Positives (TP)

True Negatives (TN)

False Positives (FP),

False Negatives (FN)
Different combinations of these fields result in a number of key metrics, including: accuracy, precision, recall, specificity and f1 score.
The confusion matrix is a compact but very informative representation of how your classification model is performing. It is the go to tool for evaluating classification models.
The confusion matrix is used for binary or multiclass, single label, classification problems.
Convolution operation (convolve)
In the context of deep learning a convolution operation is a mathematical operation between a matrix (which usually encodes an image) and another smaller matrix called the filter, where each element of the matrix is multiplied by the filter as follows ^{1}:
For a detailed explanation of convolutions in the context of Deep Learning see Christopher Olah’s excellent article or Dumoulin, V., Visin, F. (2016). A guide to convolution arithmetic for deep learning. arXiv:1603.07285v2, a fantastic paper.
^{1} Technically this is actually called crosscorrelation, which is very closely related to convolutions.
Convolutional layer
A convolutional layer applies the convolution operation between its filters and its input. On the Peltarion Platform, you can find the 2 dimensional implementation of a convolutional layer under the name of '2D Convolution.'
This type of layer helps you take advantage of the spatial information in your data e.g., the relationships between adjacent pixels in an image (this reflects the intuition that features in an image are often not dependent on the position). By applying the same filter to all positions of an image, they can help reduce the number of parameters of your model compared to fully connected layers.
Use convolutional layers on data where the spacial relationships of the data are important, like images. In theory, you can also try using convolutional layers on sequential or time series data, but LSTM units are more tailored for this data.
Convolutional neural network
A convolutional neural network is a type of neural network that makes use of convolutional, pooling and dense layers.
They are tailored to take advantage of the spacial information in your data e.g., the relationships between adjacent pixels in an image (this reflects the intuition that features in an image are often not dependent on the position). By convolving the input data with the filters of the convolutional layers and subsequently pooling their outputs with pooling layer, convolutional neural networks are able to reduce the number of parameters that your model needed to learn from the data, when compared to fully connected layers.
Use convolutional neural networks when you’re working with image data.
D
Dataset
Deconvolution
Deconvolution performs an opposite operation to the convolution operation. This is also referred to as transposed convolution, which better reflects the actual mathematical operation.
The deconvolution operation is used when we transform the output of a convolution back into a tensor that has the same shape as the input of that convolution, which is useful if we want a model that transforms images into other images, instead of just giving categorical or scalar predictions. Convolution is not itself an invertible operation, which means we cannot simply go from output back to the input. Deconvolution layers have instead to learn weights in the same way as convolution layers.
Deconvolutional layer
A deconvolutional layer applies the deconvolution (or more correctly the transposed convolution operation) between its filters and its input. On the Peltarion Platform, you can find the 2 dimensional implementation of a deconvolutional layer under the name of '2D Deconvolution.'
This type of layer helps you transform the output of a convolution back into a tensor that has the same shape as the input of that convolution, which is useful if we want a model that transforms images into other images, instead of just giving categorical or scalar predictions.
Dense (or fully connected) layer
In a dense layer every node of the layer is connected to every node in the subsequent layer. Thus, it feeds all outputs from the previous layer to all its nodes, each of which provides one output to the subsequent layer. On the Peltarion Platform this layer is represented by the dense block.
A dense layer is the most basic layer of a neural network. The combination of more than 1 of these layers results in the classsical type of neural network or multilayer perceptorn (MLP). They are very flexible in nature and can in general learn almost any mapping from inputs to outputs.
Use dense layers to build complete networks that are used for classification or prediction problems on tabular data. You can also use dense layers as the last layer(s) before the convolutional layers or recurrent neural networks.
Dimensionality reduction
Dimensionality reduction is the process of converting your data into a low(er) dimensional representation, while retaining as much information as possible. In practice this implies identifying the features in your dataset that are the most important for the model to achieve the desired objective (a.k.a., the principal variables), and then either discarding or summarizing the remaining features in terms of the principal variables.
Dimensionality reduction is useful to "reduce the amount of unnecessary information" in your data, which can help speed up model training and in some cases help reduce overfitting.
Use dimensionality reduction on datasets that have a large number of features as part of your data preprocessing step. Dimensionality reduction is also used in various embedding techniques.
Dropout
Dropout is a regularization technique which randomly zeros out (i.e. “drops out”) some of the weights in a given layer during each training iteration. The effects of this can be interpreted as training different neural networks during each training iteration. The final trained network is thus analogous to the ‘average’ of all the networks seen during training.
E
Embedding
Embeddings allow you to turn a categorical variable (e.g., words in a book) into a fixed size vector of real numbers. The key features of embeddings are that they:

map high dimensional data into a lowerdimensional space

can be trained to discover relationships between the data points (i.e., the vectors).
By transforming a categorical feature into a numeric value through embeddings, it can be used by your model either as a feature or target value.
Use embeddings to encode categorical features with a large number of categories (e.g., words or sentences) and/or when it’s important to understand how different categories of your categorical feature relate to each other.
Epoch
An epoch represents a full pass over the entire training set, meaning that the model has seen each example once. An epoch is thus the total number of examples / batch size number of training iterations.
Error
Error is the numeric value that represents how different the predicted output of your model is when compared to the expected output. It is the essential building block of a loss function.
Example
An example is an entry in your dataset which holds values for each of the features of your dataset and possibly a label as well. Usually an example is a row in your dataset and is represented as a vector of features. An example specifies what the output layer should do, given the features. The examples in your dataset should be representative of the data you expect to have available when you deploy the model and which you will use to predict the desired output.
Exploding gradient problem
The exploding gradient problem is the phenomenon of the gradients calculated by gradient descent progressively accumulating with each training iteration.
F
F1score
In theory a good model (one that makes the right predictions) will be one that has both high precision, as well as high recall. In practice however, a model has to make compromises between both metrics. Thus, it can be hard to compare the performance between a model with high recall and low precision versus a model with low recall and high precision.
F1score is a metric that summarizes both precision and recall in a single value, by calculating their harmonic mean, which allows it to be used to compare the performance across different models. It’s defined as:
Fallout
Assume a dataset that includes examples of a class 'A' and examples of class 'B' (where 'B' can stand in for a single or multiple classes). Assume further, that you’re evaluating your model’s performance to predict examples of class 'A'.
Fallout is the proportion of examples of class 'B' that was predicted as class 'A', with respect to all examples of class 'B'. In other words, the higher the value of fallout, the more examples of class 'B' will be misclassified. It’s defined as:
False Negatives (FN)
Assume a dataset that includes examples of a class 'A' and examples of class 'B' (where 'B' can stand in for a single or multiple classes). Assume further, that you’re evaluating your model’s performance to predict examples of class 'A'.
False negatives is a field in the confusion matrix which shows the cases when the actual class of the example was 'A' and the predicted class for the same example was 'B'.
False Positive (FP)
Assume a dataset that includes examples of a class 'A' and examples of class 'B' (where 'B' can stand in for a single or multiple classes). Assume further, that you’re evaluating your model’s performance to predict examples of class 'A'.
False positives is a field in the confusion matrix which shows the cases when the actual class of the example was 'B' and the predicted class for the same example was 'A'.
False positive rate (FPR)
See fallout.
Feature
A feature is an input variable. It can be numeric or categorical. For example, a house can have the following features: number of rooms (numeric), year built (numeric), neighbourhood (categorical), street name (categorical), etc. Multiple features are usually grouped in a Combined feature.
Feature set
Filter (convolution)
Filters are the main component of a convolutional layer. They are the 'windows' that slides over the input data to perform the convolution operation with the data coming into the convolutional layer.
G
Generalization
Generalization is the ability of a model to perform well on data / inputs it has never seen before. The goal of any model is to learn good generalization from the training examples, in order to later on perform well on examples it has never seen before. Generalization can thus be thought of as the notion of how well a model has characterized and encoded the signal found in the training examples.
A good indication that a model generalizes well is when the training loss is as small as possible, while at the same time keeping the gap between the training and validation loss as small as possible. The final confirmation that the model has generalized well comes from when the test loss is also low, meaning that the model performs well on never before seen examples.
Glorot uniform initialization
Glorot uniform initialization (also known as xavier initialization) is a technique used to initialize the weights of your model by assigning them values from the following unifrom distribution:
where n_{i} is the number of units in layer i.
Glorot uniform initialization helps keep the variance of the inputs, outputs, activations and gradients the same across all the layers of a network, which helps to prevent that activations and the gradient updates vanish or explode in size.
Use glorot uniform intialization with dense layers, convolution layers and LSTM units.
Gradient descent
Gradient descent is the basic algorithm used to minimize the loss based on the training set. It’s an iterative process through which the parameters of your model are adjusted and thereby gradually finding the best combination to minimize the loss. It does this by computing the gradient (or the ‘slope’) of the loss function and then ‘descending’ down it (or taking a step down the ‘slope’) towards a lower loss value.
Gradient norm
H
Hidden layer
A hidden layer is any layer in a neural network between the input layer and the output layer.
Hyperparameter
A hyperparameter is any parameter of a model or training process that has to be set / fixed before starting the training process, i.e., they are not automatically adjusted during the training. Examples of hyperparameters include: drop rate (dropout), batch size, learning rate, number of layers, number of filters, etc.
I
Imbalanced dataset
An imbalanced dataset is a dataset that contains a very different amount of examples for each of its classes.
Training on such datasets generally leads to biased models, since each class affects the loss proportionally to its frequency.
A common way to improve results on infrequent classes is to create a new dataset and balance it by oversampling or undersampling examples.
Class weighting can also be used if the imbalance occurs in a categorical feature which is the model’s target; for instance, when learning to identify normal and anomalous classes from a dataset having many more normal examples than abnormal ones.
Initializer
Input
Input is the series of examples fed to a layer.
Input layer
The input layer is the first layer of a neural network and is the one which take the individual examples of your training set as input to the model. On the Peltarion Platform this layer is represented by the Input block. You can think of the Input block as a 'placeholder' for the data that is going to be fed into the model.
J
K
Kernel (convolution)
See filter (convolution)
L
Label
A label is a class (in classification) or a target (in regression) assigned to examples in a dataset. When training a model, labels are how you say what you want them to predict. For example, in image classification: "See an apple, say 'apple'"
Learning
Learning rate
The learning rate is a hyperparameter of gradient descent. It’s a scalar which controls the size of the update steps along the gradient.
Choosing the right learning rate is crucial for optimal gradient descent and thus, for optimal training. Too small and gradient descent will only take small steps in each iteration, meaning model training will be slow. Too big and gradient descent will take too large of a step, potentially ending up 'bouncing around' the loss surface, making training unstable.
Linear activation function
The linear activation_function is a straight line function where the node’s activation is proportional to the input.
Use in the last layer of a regression model, if the output variable does not have known upper and lower bounds.
Range: ∞ to +∞
Loss
Loss is a measure of how well your algorithm models your dataset. It’s a numeric value which is computed by the loss function. The lower the loss, the better the performance of your model.
Loss function
A loss function is a function that determines how errors are penalized. The goal of the loss function is to capture in a single number the total amount of errors across all training example.
LSTM unit
A long short term memory unit is a special kind of recurrent neural network building block that has a build in ability to 'remember' or 'forget' parts of sequential data. This ability allows a RNN using LSTM units to learn very long range connections in sequential data, by keeping relevant information 'stored' in the unit.
Use in recurrent neural networks.
M
Mean absolute error (MAE)
Mean absolute error (MAE) is a loss function used for regression. The loss is the mean over seen data of the absolute differences between true and predicted values, or writing it a a formula:
where ŷ is the predicted expected value and y is the observed value.
MAE is not sensitive towards outliers and given several examples with the same input feature values, the optimal prediction will be their median target value. This should be compared with mean squared error, where the optimal prediction is the mean. A disadvantage of MAE is that the gradient magnitude is not dependent on the error size, only on the sign of y_{i}  ŷ_{i}. This leads to that the gradient magnitude will be large even when the error is small, which in turn can lead to convergence problems.
Use mean absolute error when you are doing regression and don’t want outliers to play a big role. It can also be useful if you know that your distribution is multimodal, and it’s desirable to have predictions at one of the modes, rather than at the mean of them. MAE is can also be used as a performance metric, since it’s easy to interpret.
Mean squared error (MSE)
Mean squared error (MSE) is the most commonly used loss function for regression. The loss is the mean over seen data of the squared differences between true and predicted values, or writing it as a formula:
where ŷ is the predicted expected value and y is the observed value.
Minimizing MSE is equivalent of maximizing the likelihood of the data under the assumption that the target comes from a normal distribution, conditioned on the input.
MSE is sensitive towards outliers and given several examples with the same input feature values, the optimal prediction will be their mean target value. This should be compared with mean absolute error, where the optimal prediction is the median. MSE is thus good to use if you believe that your target data, conditioned on the input, is normally distributed around a mean value, and when it’s important to penalize outliers extra much.
Mean squared logarithmic error (MSLE)
Mean squared logarithmic error (MSLE) is, as the name suggests, a variation of the Mean Squared Error. The loss is the mean over the seen data of the squared differences between the logtransformed true and predicted values, or writing it as a formula:
where ŷ is the predicted expected value and y is the observed value.
This loss can be interpreted as a measure of the ratio between the true and predicted values, since:
The introduction of the logarithm makes MSLE only care about the relative difference between the real and the predicted value, or in other words, it only cares about the porcentual difference between them. This means that MSLE will treat small differences between small true and predicted values approximately the same as big differences between large true and predicted values. MSLE also penalizes underestimates more than overestimates, introducing an asymmetry in the error curve.
MinMax Normalization
Minmax normalization performs a linear rescaling of your data. It transforms the lowest value of an input feature to 0, and the highest value 1. Every other value in between the min and max values is transformed to a decimal between 0 and 1.
Applied across your Combined feature, it will guarantee that all values are between 0 and 1, which makes it easier for a network to train since all the values are within the same range. It’s also a computationally efficient way to normalize your features when they have a bounded range of values and when they don’t have large outliers.
Use minmax normalization to normalize your data when it’s bounded (e.g., pixel values in an image) and when it doesn’t contain any large outliers. (If your dataset has large outliers use standardization instead).
Minibatch
A batch can be subdivided into smaller minibatches.
Minibatches are used as a way to speed up gradient descent through minibatch gradient descent.
Minibatch (stochastic) gradient descent
Minibatch gradient descent is an implementation of gradient descent which approximates the real gradient of the loss function, which is computed by taking into account all the training examples, with an approximated gradient, which is calculated by iteratively taking into account all the training examples in a minibatch, until it has gone through all minibatches.
Minibatch gradient descent tries to take the best of batch gradient descent and stochastic gradient descent. It’s much faster than batch gradient descent, but avoids the extreme fluctuations in the parameter updates of SGD (i.e., it’s more stable than SGD), thus making model training faster without any significant downsides.
Note

The term SGD is usually used to both describe the single example and the minibatch method, since the difference between both is only in the batch size. 
Model
Model can have two meanings:

It’s the combination of your neural network architecture and your specific hyperparameter settings. On the Peltarion Platform, a model is a sequence of blocks that have been strung together to achieve the desired model architecture.

It’s the (mathematical) representation that the neural network has learned after having been trained on the training set.
Momentum
Momentum compares the gradient of the previous iteration with the gradient of the current iteration and then taking bigger steps for the dimensions for which the gradients point in the same direction and smaller steps for the dimensions in which they don’t. In other words, SGD 'gains momentum' in those directions where the gradient (or the 'slope') are pointing in the same direction in subsequent iterations. In the analogy of gradient descent being equal to a ball rolling down a hill, momentum would be equal to adding 'inertia' to the ball or similarly using a heavier ball to go down the same hill.
Momentum helps dampen the fluctuations of SGD and helps it accelerate towards the relevant gradient direction.
Multilayer Perceptron
A multilayer perceptron (also known as a feedforward neural network) is a type of neural network that only makes use of dense layers. It is the classical type of neural network.
MLPs are very flexible in nature and work well on any number of prediction or classification problems that involved tabular data.
Use MLPs when you’re working with tabular data.
N
Natural language processing (NLP)
Natural language processing (NLP) techniques aim to automatically process, analyze and manipulate (large amounts) of language data like speech and text.
Nesterov accelerated gradient (NAG)
Nesterov accelerated gradient is a modification of momentum which introduces and additional term that 'looks ahead' at the upcoming iteration and approximates the expected parameter update. It uses this approximation to tune the amount of 'momentum that momentum imparts on SGD. If the parameters of the upcoming iteration show that the gradient (or 'slope') is increasing, NAG will reduce the amount of 'momentum' in anticipation.
Neural network
A neural network is composed of a large number of processing units called nodes that are highly interconnected with each other and arranged in structures called layers. These networks of nodes are able to process information in such a way that is able to solve problems such as pattern recognition and data classification through a learning process.
Neuron
See Node
Node
A node is the basic computation unit of a neural network. It takes the weighted sum of all of its inputs and feeds the result to an activation function to produce a single output.
Noise
Noise is anything that is not the signal. Thus, noise in this sense doesn’t refer to the everyday notion of noise, like "noise in a photo caused due to poor lighting conditions". Instead it’s the abstract notion of any information contained in the data that is not relevant to modelling the relationships between the input data and the target one wishes to learn.
Normalization (concept)
Normalization is the process of ‘resizing’ values (e.g., the outputs of a layer) from their actual numeric range into a standard range of values.
This process makes all the values across features be more consistent with each other, which can be interpreted as making all the values across features of equal importance. This helps speed up the training of the network.
You should always normalize your data before you start training your network.
O
Onehot encoding
Onehot encoding is a method which allows you to convert a categorical feature into a binary vector. For example, if your variable is ‘apple_color’ and the possible values it can take are ‘Red’, ‘Yellow’ and ‘Green’, the feature values can be encoded as follows:
Onehot encoding is a simple form of embedding. Categorical encoding in the Datasets view is the same thing as onehot encoding
Many deep / machine learning algorithms require their input and output data to be numeric. By transforming a categorical feature into a numeric value using onehot encoding, it can be used by a deep / machine learning algorithm, either as a feature or target value, to train a model.
Use categorical encoding with categorical features in your dataset (e.g., labels) that have a relatively small number of categories.
Optimizer (gradient descent)
An optimizer is a specific implementation of gradient descent which improve the performance of the basic gradient descent algorithm.
Optimizers aim to mitigate some of the challenges that are characteristic of gradient descent like: convergence to suboptimal local minima and setting the starting learning rate and it’s decay.
In practice, using a gradient descent optimizer is the goto choice for training deep learning models.
Output layer
The output layer is the last layer of a neural network and is the one which returns the results of your model (e.g., the class in a classification model or a value in a prediction model). On the Peltarion Platform this layer is represented by the input block.
Overfitting
Overfitting is the phenomenon of a model not performing well, i.e., not making good predictions, because it captured the noise as well as the signal in the training set. In other words, the model is generalizing too little and instead of just characterizing and encoding the signal it’s encoding too much of the noise found in the training set as well. (Another way to think about this is that the model is trying to fit 'too much' to the training data).
This means that the model performs well when it’s shown a training example (resulting in a low training loss), but badly when it’s shown a new example it hasn’t seen before (resulting in a high validation loss).
Oversampling
It’s the process of balancing a dataset by reusing examples of the underrepresented classes so that every class of the dataset has an equal amount of examples.
A balanced dataset allows a model to learn equal amounts of characteristics from each of the classes represented in the dataset, as opposed to one class dominating what the model learns.
Use when you have an imbalanced dataset. Oversampling is usually preferred to undersampling as data is rarely overabundant.
P
Padding
Padding is the process of adding one or more pixels of zeros all around the boundaries of an image, in order to increase its effective size.
Convolutional layers return by default a smaller image than the input. If a lot of convolutional layers are strung together, the output image is progressively reduced in size until, eventually, it might become unusable. By padding an image (i.e., "increasing" its size) before a convolutional layer, this effect can be mitigated. The relationship between padding and the output of a convolutional layer is given by:
Use padding in convolutional neural networks.
Parameter
A parameter is any internal variable of your model that is automatically adjusted during training in order to minimize the loss function. An example of parameters are each of the weights in a neural network.
Poisson loss
The poisson loss is a loss function used for regression when modelling count data. The loss takes the form of:
where ŷ is the predicted expected value, and y is the observed value.
Minimizing the poisson loss is equivalent of maximizing the likelihood of the data under the assumption that the target comes from a poisson distribution, conditioned on the input.
The poisson loss is a specifically tailored for data follows the poisson distribution. Examples of this are number of customers that will enter a store on a given day, number of emails that will arrive within the next hour, or how many customers that will churn next week.
Use the poisson loss when you believe that the target value comes from a poisson distribution and want to model the rate parameter conditioned on some input.
Pooling
Pooling is the process of summarizing or aggregating sections of a given data sample (usually the matrix resulting from a convolution operation) into a single number. This is usually done by either taking the maximum or the average value of said sections.
This operation helps you reduce the number of parameters of your model and introduces translational invariance in the features extracted by your model (i.e., your model will be less sensitive to small translations of the input data).
Use pooling in convolutional neural networks.
Pooling layer
A pooling layer applies the pooling operation to its input (usually the output of a convolutional layer). On the Peltarion Platform, you can find the three kinds of pooling layers: '2D Max pooling', '2D Average pooling', 'Global Average pooling'
This type of layer helps you reduce the number of parameters of your model and introduces translational invariance in the features extracted by your model (i.e., your model will be less sensitive to small translations of the input data).
Use after a <<Glossary#Convolutional_layer,convolutional layer>>.
Positive predictive value (PPV)
See precision.
Precision
Assume a dataset that includes examples of a class 'A' and examples of class 'B' (where 'B' can stand in for a single or multiple classes). Assume further, that you’re evaluating your model’s performance to predict examples of class 'A'.
This metric is the proportion of examples of class 'A' that are correctly predicted as class 'A', with respect to all examples predicted as class 'A'. In other words, the higher precision, the fewer examples of class 'A' will be misclassified. It’s defined as:
Q
R
Random uniform initialization
Random uniform initialization is a technique used to initialize the weights of your model by assigning them values from a uniform distribution with zero mean and unit variance.
Random uniform initialization helps generate values for your weights that are simple to understand intuitively.
This used to be the standard procedure to initialize weights, but it has been superseded by other weight initialization techniques (like glorot uniform initialization). Use random uniform initialization with dense layers, convolutional layers and LSTM units, but be aware that it may cause your network not to train as effectively as with more modern choices.
Recall
Assume a dataset that includes examples of a class 'A' and examples of class 'B' (where 'B' can stand in for a single or multiple classes). Assume further, that you’re evaluating your model’s performance to predict examples of class 'A'.
This metric is the proportion of examples of class 'A' that are correctly predicted as class 'A', with respect to all examples of class 'A'. In other words, the higher recall, the fewer examples of class 'A' we will miss. It’s defined as:
Recurrent neural network (RNN)
A recurrent neural network is a type of neural network that makes use of LSTM units and dense layers.
They are tailored to take advantage of sequential information e.g., the relationships between words in a sentence or notes in music.
Use recurrent neural networks when you’re working with sequential data.
Regression predictions
Regression prediction is the process through which a trained model predicts a value or the probability of a target.
To train the model, a training set needs to have multiple labeled examples for the desired prediction target.
Regularization
Regularization discourages a model from becoming too complex, which in turn prevents the model from overfitting the training set.
Regularization artificially constrains and ultimately reduce the absolute value of the parameters of a model, by affecting the range of the values each can take and the total number of active (i.e., nonzero) parameters.
ReLU (rectified linear unit) activation function
The ReLU (rectified linear unit) is a nonlinear function that gives the same output as input if the input is above 0, otherwise the output will be 0.
It is cheap to compute and works well for many applications. It also helps prevent the vanishing gradients problem.
It is the goto activation function for many neural networks.
Reshape
Reshape takes a data structure of any shape as input and changes its shape to an userdefined one. The only parameter is the desired shape, expressed as a list of comma separated integers, one for each of the dimensions of the new shape. The product of all dimensions in the list must equal the product of all dimensions in the input.
Reshape is used to transform the shape of your data structure, into the one expected by parts of your model, as these sometimes do not match.
RMSPROP
RMSprop is an extension of Adagrad that deals with Adagrad’s radically diminishing learning rates. RMSprop divides the learning rate by an exponentially decaying average of squared gradients. Hinton suggests the set the fuzz factor epsilon to 0.9. A good default value for the learning rate is 0.001. This optimizer is usually a good choice for recurrent neural networks (RNN).
S
Semantic image segmentation
Semantic image segmentation is a deep learning technique that assigns a label to every pixel of an image with the goal of associating each pixel to a class. It’s semantic because it tries to classify each pixels base on their relationships (e.g., neighboring pixels or pixels of similar color are likely to belong to the same class).
Sequence length
Sequence length determines how many tokens are included from a text feature in each sequence. The sequence length can be between 3 and 512 tokens.
The unit of length is the token, whose precise definition depends on the Language models selected in the text Feature_encoding.
Input that is less than sequence length becomes padded. Text features larger than Sequence length are cut off. This may discard significant information if the Sequence length is much smaller than the average text feature.
The larger the sequence length, the bigger the example size – which may result in slower training time, or your model exceeding hardware memory limits.
Sigmoid activation function
The sigmoid activation function generates a smooth nonlinear curve that maps the incoming values between 0 and 1.
The sigmoid function works well for a classifier model but it has problems with vanishing gradients for high input values, that is, y change very slow for high values of x. Unlike the softmax activation function, the sum of all the outputs doesn’t have to be 1 when sigmoid is used as an activation function in the output layer. This means that each output node with a sigmoid activation function acts independently on each input, so more than one output node can fire at the same time.
The sigmoid function is often used together with the loss function binary crossentropy.
Use for binary classification or multilabel classification problems.
Signal
Softmax activation function
The softmax activation function will calculate the relative probability of each target class over all possible target classes in the dataset given the inputs it receives. In other words it normalizes the outputs so that they sum to 1, so that they can be directly treated as probabilities over the output.
This is usefull for multiclass classification models, as the target class with the highest probability is going to be the output of the model.
It is often used in the final layer in a classification model with the categorical crossentropy as loss function.
Specificity
Assume a dataset that includes examples of a class 'A' and examples of class 'B' (where 'B' can stand in for a single or multiple classes). Assume further, that you’re evaluating your model’s performance to predict examples of class 'A'.
This metric is the proportion of examples of class 'B' that are correctly predicted as class 'B', with respect to all examples of class 'B'. In other words, the higher specificity, the fewer examples of class 'B' we will miss. It’s defined as:
Squared hinge loss
Squared hinge loss is a loss function used for “maximum margin” binary classification problems. Mathematically it is defined as:
where ŷ is the predicted value and y is either 1 or 1.
Thus, the squared hinge loss is:
0:
quadratically increasing with the error:
Note

ŷ should be the actual numerical output of the classifier and not the predicted label. 
The hinge loss guarantees that, during training, the classifier will find the classification boundary which is the furthest apart from each of the different classes of data points as possible. In other words, it finds the classification boundary that guarantees the maximum margin between the data points of the different classes.
A sample use case would be when you want to classify email into ‘spam’ and ‘not spam’ and you’re only interested in the classification accuracy.
Standardization
Standardization (also known as ZScore normalize) performs a rescaling of your data so that it has a zero mean and a unit standard deviation. Values above the feature’s mean value will get positive scores, and those below the mean will get a negative score.
Applied across your Combined feature, it will guarantee that your data will be gaussian with mean zero and standard deviation one. This is desirable as:
Note that unlike minmax normalization, standardized data will not have the exact same scale (because the gaussian allows values between ∞ and +∞), but it will handle outliers in your data well.
Use standardization when your data has different units / scales, is not restricted to a range of values and / or when it has large outliers. Note that when you perform a standardization, you make the implicit assumption that the input data are normally distributed. However, standardization typically works well even when this is not the case.
Stochastic gradient descent (SGD)
Stochastic gradient descent (SGD) is an implementation of gradient descent which approximates the real gradient of the loss function, which is computed by taking into account all the training examples, with an approximated gradient, which is calculated by iteratively taking a single training example at a time until it has gone through all training examples.
This method is much faster than batch gradient descent, but it doesn’t calculate the real gradient of the loss function, since it only uses one training example at a time. This has the effect that the parameter updates made by SGD can fluctuate significantly (i.e., updates can be unstable and not necessarily correspond to a global pattern in the data), which translates to a model potentially being hard to train. This can be somewhat mitigated by slowly decreasing the learning rate throughout training. SGD is also more computationally expensive than batch gradient descent, since it’s calculating the gradient much more often.
Use SGD for online learning applications.
Note

The term SGD is usually used to both describe the single example and the minibatch method, since the difference between both is only in the batch size. 
Subset
A subset is a smaller set of your dataset. For the purpose of training a model, you usually subdivide your dataset into three subsets: training set, validation set and test set.
Supervised learning
T
Tanh activation function
Tanh is a scaled sigmoid activation function. The gradient is stronger for tanh than sigmoid, that is, the derivatives are steeper.
Unlike the sigmoid function, the tanh function is zerocentered, which means that it dosen’t introduce a bias in the gradients making training a network easier. The downsinde is that tanh is computationally more expensive than the sigmoid function.
Which one to use of the sigmoid or tanh depends on your requirement of gradient strength. Tanh resembles a linear function more as long as the activations of the network can be kept small. This makes the tanh network easier to compute.
Target
Target represents the desired output that we want our model to learn. In the case of a classification problem, the targets would be the labels of each of the examples in the training set.
Test set
A test set is a subset of your dataset that is used to check the performance of the model that was learned during training. It consists of a set of examples that the model has never seen, which help confirm its prediction accuracy.
The test set is only used after training is completed and is used to provide a final assessment of the performance of the model. Note that the validation set is not able to do this, since it was used during training to adjust the hyperparameters and/or the architecture of the model.
In short: the test set is used to assess the model’s performance (i.e., generalization and predictive power)
Training
Training is the process of building a model by setting the ideal parameters through the use of gradient descent applied on the training set.
Training example
A training example is an example that is included in your training set.
Training iteration
Training loss
Training loss is the average loss per training example of your model based on your training set.
Training set
A training set is a subset of your dataset which contains all the examples available to a neural network to create a model during training. It’s the data that gradient descent runs on, in order to adjust the parameters of the model.
In short: the training set is used to fit the model parameters (i.e., weights).
True negative rate (TNR)
See specificity.
True negatives (TN)
Assume a dataset that includes examples of a class 'A' and examples of class 'B' (where 'B' can stand in for a single or multiple classes). Assume further, that you’re evaluating your model’s performance to predict examples of class 'A'.
True negatives is a field in the confusion matrix which shows the cases when the actual class of the example was 'B' and the predicted class for the same example was also 'B'.
True positive rate (TPR)
See recall.
True positives (TP)
Assume a dataset that includes examples of a class 'A' and examples of class 'B' (where 'B' can stand in for a single or multiple classes). Assume further, that you’re evaluating your model’s performance to predict examples of class 'A'.
True positives is a field in the confusion matrix which shows the cases when the actual class of the example was 'A' and the predicted class for the same example was also 'A'.
U
Underfitting
Underfitting is the phenomenon of a model not performing well, i.e., not making good predictions, because it wasn’t able to correctly or completely capture the signal in the training set. In other words, the model is generalizing too much, to the point that it’s actually missing the signal.
This means that the model doesn’t perform well on training examples (resulting in a high training loss), nor on examples it hasn’t seen before (resulting in a high validation loss).
Undersampling
It’s the process of balancing a dataset by discarding examples of one or more overrepresented classes so that each has the same amount of examples.
A balanced dataset allows a model to learn equal amounts of characteristics from each one of the classes represented in the dataset, as opposed to one class dominating what the model learns.
Use when you have an imbalanced dataset. Note that oversampling is usually preferred to undersampling as data is rarely overabundant.
Unsupervised learning
Unsupervised learning is the task of learning a model from a dataset that doesn’t have labeled examples. In other words, it is the task of learning a model that captures the underlying (or hidden / latent) structures and patterns in the dataset. Examples include: clustering and dimensionality reduction.
V
Validation example
A validation example is an example that is included in your validation set.
Validation loss
Validation loss is the average loss per validation example of your model based on your validation set.
Validation set
A validation set is a subset of your dataset which contains examples available to a neural network to adjust the hyperarameters or the model architecture based on the validation loss.
The validation set is used during training to run validation examples through the model after each epoch, in order to compute the validation loss. A good model (one that generalizes well) is one where the training loss is as small as possible, while at the same time keeping the gap between the trainingand validation loss as small as possible.
If the validation loss is high or starts increasing during early training, training can be stopped to adjust the hyperarameters or the model architecture, in order to improve the model’s performance. Alternatively, if the validation loss starts increasing after it being at a comparatively low level, training can be stopped to prevent the model from overfitting.
In short: the validation set is used to tune the model’s hyperarameters or the model architecture (i.e., learning rate, number of layers, etc.)
Vanishing gradient problem
The vanishing gradient problem is the phenomenon of the gradients calculated by gradient descent getting progressively smaller when moving backward in the networks from output to input layer.
W
Weight initialization
Weight initialization is the process of assigning some starting values to the weights of your model, before starting training.
The starting values of the weights have a significant impact on the training of your model. Naïve initialization strategies, like making the initial value of all weights equal to 0, can result in your model not learning anything at all (or in other words, gradient descent is unable to converge). A good weight initialization strategy can also help prevent the vanishing / exploding gradient problem.
Always use a weight initilization strategy with dense layers, convolutional layers and LSTM units.
Weights
The weights and the operations performed on them by a neural network are the way your model is encoded in the network. A weight is a trainable parameter and it’s value can be thought of as the strength of the connection between different nodes. The higher the weight, the stronger the connection between two nodes or alternatively, the more important that connections is for the model to predict the target.
X
Xavier initialization
Y
Z
ZScore normalize
See standardization.