An essential overview of some crucial concepts of Deep Learning techniques.
Prerequisites: differential analysis, linear algebra, optimization problems
Let’s define at a high level what a learning algorithm is:
An algorithm is said to learn from experience E with respect to some class of tasks T and performance P, if its performance at tasks in T, as measured by P, improves with experience E.
From the above definition, three entities are underlined:
T = {classification, regression, clustering, object detection, image recognition, question-answering, grammar correction, …. }
P = {accuracy, F-beta score, cross-entropy, precision-recall curve, ….}
E = {supervised, unsupervised, reinforcement learning, …}
The above definition is very general, and even if allows to get a general idea, is saying nothing about how build a learning algorithm.
It can be useful, starting with a baseline model, in order to underline the main difference a deep learning model has with respect to a general machine learning ones.
Baseline example: Linear Regression
In the linear regression algorithm, the final output gives a predictor function, of the form:
The optimal parameters for the problem are obtained by solving, in an analytic form, the equation:
Where the loss function, can be as follows:
Because, the linear form of the predictor function, the gradient equation has an analytic solution for the weights, as follows:
It is known that the obtained optimal weights are the best possible ones, because the Loss function, as above, is a convex function, therefore its hessian (its second derivates) does not depend on the weights, their are constant. This implies, the function has an unique global minimum.
The above algorithm is called Linear Least Square.
Of course, not all problems can be modeled by using a linear function, instead very often non-linearity is observed, therefore the above algorithms, even if it returns an optimal analytic form of the model’s parameters, it is not always practicable.
Deep Learning approach:
In the DL approach, the final goal remains the same: getting an approximation function f*, of the real function f. A Feedforward neural network (FFN) defines a mapping
However, the final form of the mapping is not imposed, instead, it is found by the network itself in an automatic way.
The term feedforward derived by the concept that, the information flows through the function being evaluated from x, through the intermediate computations used to define f. The term networks is because, the FFN are typically represented by composing together many different functions. Indeed, the model is associated to a Direct Acyclic Graph (DAG) describing how the functions are composed together. The term neural derived from the fact that, the smallest model unit, is a mathematical representation (the preceptor model), of the functional acts of a brain neuron, as reported from neuroscience.
The final output of a FFN, has the form of a recursively nested composition functions of the inputs:
The non-linearity of the final function is not given by the recursively nested composition of several functions, because, of course, stacked together multiple linear functions will get nothing more than a linear function ones. Instead, each function, is itself a non-linear ones. The non-linearity is given by the activation function, as explained below.
The FNN model, can be described as a series of functional transformation, composing of M linear combinations of the input variables
Where, k is referred to the number of the selected layer, i goes from 1 to M (the dimension of the input vector), w_ji are the weights, while w_j0 is the bias. The term a^k_j, is the j-ith input activation of the k-ith layer. Each input activation is then transformed using a differentiable, non-linear function to give (activation function):
Below, is reported a typical non-linear activation function h used in the applications, it is differentiable and not linear.
In the DL framework the optimal parameters are obtained by following the same mathematical concept of the Least Square, as above. A loss function is defined, its gradient is computed, with respect to the model parameters, however, different from the Linear Least Square, the loss function is not more a convex ones, therefore, the existence of a unique global minimum, is not more guaranteed.
Stochastic Gradient Descent (SGD)
The DL training procedure requires to solve an optimization problem. The model weights are learned by moving them to the direction of minimum values of the loss function:
Now, the procedure is as depicted below. First of all, the gradient of the loss function, with respect to the weight variables, is computed. Then, the weights are updated in order to move the loss function forward lower values. The alpha parameter is called learning rate, it controls the velocity of the learning process (it is an important parameter which needs to be tuned in a training phase).
The term stochastic comes by the fact that the algorithm is computed for a batch of M input values such that M<N, where N is the total number of inputs. The smallest batch of M elements is extracted from the total subset of N elements in a random way, for a number E of epochs.
References:
- Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016.
- Linear Algebra and Optimization for Machine Learning: Charu C. Aggarwal