Feedforward Neural Networks (Multi layers Preceptors MLPs)

5 min readMar 25, 2022

An essential overview of some crucial concepts of Deep Learning techniques.

Prerequisites: differential analysis, linear algebra, optimization problems

Let’s define at a high level what a learning algorithm is:

An algorithm is said to learn from experience E with respect to some class of tasks T and performance P, if its performance at tasks in T, as measured by P, improves with experience E.

From the above definition, three entities are underlined:

T = {classification, regression, clustering, object detection, image recognition, question-answering, grammar correction, …. }
P = {accuracy, F-beta score, cross-entropy, precision-recall curve, ….}
E = {supervised, unsupervised, reinforcement learning, …}

The above definition is very general, and even if allows to get a general idea, is saying nothing about how build a learning algorithm.

It can be useful, starting with a baseline model, in order to underline the main difference a deep learning model has with respect to a general machine learning ones.

Baseline example: Linear Regression

In the linear regression algorithm, the final output gives a predictor function, of the form:

The functional relationship between the regressor and the regressed is suppose to be linear. Please note that, the linear relationship is related to the weights coefficient, while the value of the input x can be even not linear.

The optimal parameters for the problem are obtained by solving, in an analytic form, the equation:

The gradient, computed as a function of the weights, of the loss function, is imposed to be zero. The weights which solve that equation gives the best (optimal) parameters for the problem.

Where the loss function, can be as follows:

Root mean square error. It measures the distance between the true target and the predicted ones.

Because, the linear form of the predictor function, the gradient equation has an analytic solution for the weights, as follows:

The optimal weights

It is known that the obtained optimal weights are the best possible ones, because the Loss function, as above, is a convex function, therefore its hessian (its second derivates) does not depend on the weights, their are constant. This implies, the function has an unique global minimum.

The above algorithm is called Linear Least Square.

Of course, not all problems can be modeled by using a linear function, instead very often non-linearity is observed, therefore the above algorithms, even if it returns an optimal analytic form of the model’s parameters, it is not always practicable.

Deep Learning approach:

In the DL approach, the final goal remains the same: getting an approximation function f*, of the real function f. A Feedforward neural network (FFN) defines a mapping

Goal: find the best parameters that gives the smallest error between the prediction and the target.

However, the final form of the mapping is not imposed, instead, it is found by the network itself in an automatic way.

The term feedforward derived by the concept that, the information flows through the function being evaluated from x, through the intermediate computations used to define f. The term networks is because, the FFN are typically represented by composing together many different functions. Indeed, the model is associated to a Direct Acyclic Graph (DAG) describing how the functions are composed together. The term neural derived from the fact that, the smallest model unit, is a mathematical representation (the preceptor model), of the functional acts of a brain neuron, as reported from neuroscience.

The final output of a FFN, has the form of a recursively nested composition functions of the inputs:

the final output, is the result of n composed functions, where n is the number of layers of the model.

The non-linearity of the final function is not given by the recursively nested composition of several functions, because, of course, stacked together multiple linear functions will get nothing more than a linear function ones. Instead, each function, is itself a non-linear ones. The non-linearity is given by the activation function, as explained below.

The FNN model, can be described as a series of functional transformation, composing of M linear combinations of the input variables

Inputs data

j-ith linear combination of the input variables (hidden unit), for the k-ith layer.

Where, k is referred to the number of the selected layer, i goes from 1 to M (the dimension of the input vector), w_ji are the weights, while w_j0 is the bias. The term a^k_j, is the j-ith input activation of the k-ith layer. Each input activation is then transformed using a differentiable, non-linear function to give (activation function):

non-linear transformed activation of the hidden units j of the k-ith layer.

Below, is reported a typical non-linear activation function h used in the applications, it is differentiable and not linear.

A typical activation function used in the applications.

Linear Algebra and Optimization for Machine Learning: Charu C. Aggarwal

In the DL framework the optimal parameters are obtained by following the same mathematical concept of the Least Square, as above. A loss function is defined, its gradient is computed, with respect to the model parameters, however, different from the Linear Least Square, the loss function is not more a convex ones, therefore, the existence of a unique global minimum, is not more guaranteed.

Stochastic Gradient Descent (SGD)

The DL training procedure requires to solve an optimization problem. The model weights are learned by moving them to the direction of minimum values of the loss function:

Loss function, seen as a function of the only weights.

Now, the procedure is as depicted below. First of all, the gradient of the loss function, with respect to the weight variables, is computed. Then, the weights are updated in order to move the loss function forward lower values. The alpha parameter is called learning rate, it controls the velocity of the learning process (it is an important parameter which needs to be tuned in a training phase).

The term stochastic comes by the fact that the algorithm is computed for a batch of M input values such that M<N, where N is the total number of inputs. The smallest batch of M elements is extracted from the total subset of N elements in a random way, for a number E of epochs.

References:

Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016.
Linear Algebra and Optimization for Machine Learning: Charu C. Aggarwal

Mlearning.ai Submission Suggestions

How to become a writer on Mlearning.ai

medium.com