
Demystify the main concepts underling CNNs: How does a convolution filter work? What is a Pool layer? How can you write a CNN using Tensorflow/Keras?
How does a convolution work? It’s know that CNNs work very well with images, moreover they allow to reduce the number of parameters for the network, that’s another great characteristic. How is it possible? The convolution filter is the key. A convolution layer gives a convolution form of the initial image. In particular, a convolution layer is defined by its own filter that is a matrix containing some weights. Each image’s pixel is scanned, in order to get a new value of it taking into account the information of the filter. Each time a pixel is scanned, its own neighbours are considered for the new final value of the select pixel.

In the figure, it is considered a 3x3 filter. At the end it is obtained a new value for the considered pixel (192). Why? Its first neighbours values are multiplied by the correspondent weights values of the filter.
This way of procedure allows to select particular features that are unique of the specific considered class. For example, each human face is quite different from another one, but of course all human faces are different from dog faces. It is that general characteristic that belongs to the class=human faces, the CNNs are able to catch up. Of course, these types of techniques were already known in literature and you can go deeper starting from here --> Kernel (image processing) — Wikipedia
Let’s see a practical example of how a convolution filter works. We want to really figure out how an image is changed by the action of a convolution filter.

First of all, import some libraries, and select a particular image from a dataset that will help us with the testing. It is a gray image of a stairwell and it contains enough features to play around with it.
Now, let’s define the filter and the image where all the transformations will be stored.

We’ve defined a 3x3 filter for this test. Note that, the filter values sum up to zero. Now, let’s define the main actions to be performed to the image in order to apply the filter. They are really straightforward, simply they select a pixel and multiply it and all of its own first neighbours for the values of the filter, in order to define a convoluted version of the original image.

Our final image is a convoluted version of the initial image, as defined by the applied filter.
Let’s see the final result of the image.

As you can see, this particular filter allows to detect very well the vertical lines presented in the image. Another filter will help to detect the horizontal ones, and so on. A set of filters will help us to detect particular features that are common to all the objects of that particular class.
In the definition of a Convolutional Layer, the number of filter (kernel) to be used defines the number of weights, one matrix weights for each filter, and the shape of the filter define the shape of the matrix weights. Indeed, the convolution operation between the input images and the filters defines the outputs layer composed by the hidden neurons, and the activation value for the j, k-th hidden neuron will be:

Where, b is the shared bias, w the shared weights matrix, and a the input activation at position x=j+l, y=k+m.
Pooling
Let’s see how another principal component of the CNNs works: The Pool Layer. Essentially, the pooling effects are to kick out all the useless information and keeping only the information really helpfull. In particular, the main objective of pooling is to mantain all the detected features, and discard all other information.
There are several types of Pool Layers, in this story we’re focusing on the MAX Pool Layer. The MAX Pool Layer must select only the greater pixel of a given matrix. The dimension of the matrix is defined by the specific of the pool layer.

Here, as before, the Pool actions definition. We define a 4x4 Max Pool with the aim to reduce the image size to 1/4 of the original one, however mantaining all the detected features.

And finally, the image as obtained after the Max Pool layer

Nice. For me it’s really helpfull knowing what hell is happening at the underground level of the Neural Network I’m using in order to know how to generalise from the standard architectures.
Now, we have (hopefully) a deeper understanding of how a CNN works. Therefore, we can apply the concepts for building our own CNN for the famous MNIST dataset. But, this time, we’ll really know what hell we’re doing ;).
Building a CNN
Let’s make a recap. A convolution filter returns a convoluted version of an initial image where a specific feature is stand out. A Pool Layer reduces the information in that image, however keeping all the detected features. These two techniques are used at the beggining of a CNN, while at the end will be a Dense layer, with the goal to understand just a really focused version of the image: a image that has been scanned by a Convolution Layer and by a Pool Layer.
We’ll use the fashion_mnist dataset, for this test, and will build a CNN with two convolution layers at the top and two dense layers at the bottom.
Keras will help us, thanks its very intutive python API on top of Tensorflow, allowing to speed-up the building phase. Let’s all the steps one by one.

First of all, we get the fashion mnist dataset from the cloud. Then, we split the dataset into a training and a validation dataset, and finally, (more important fot the current scope), we reshape the training images and the validation images in order to have a 4D tensor containing all the information about the shape of the input data. Indeed, initially the images were saved as items in a list of the form 60,0000 28x28x1, now we have a unique 4D list. This because, the convolution layer at the top wants a single tensor containing everything. Then, we constrain the images values to be in the range [0, 1].

Here, our model. It is a sequential model (inputs go inside the network in a sequential way, or ordered). At the top, we have the first Convolution Layer. The first parameter defines the number of filters this layer will use. The choice is free, indeed it is an hyperparameter of the model, to be tuned. We’ll use 68 filters, hoping to detect 68 useful features. Then, we define the dimension of each filter, they will be 3x3 matrix. Then, we select the activation function to be a Relu one (simply it outputs x if x>0 otherwise 0). Finally, we define the input shape as 28x28x1 (they are gray images). After each convolution layer, we apply a Max Pool of 2x2 dimension, meaning from each 4x4 sub-matrix we select just one value: the maximum, as described before. Finally, after the convolution part, we use a Flatten layer, in order to pass the data to a Dense layer, a 20-neurons Dense layer with a relu as activation function, and finally a 10-neurons Dense layer with a softmax as activation function, because we want to know which is the probability of the image to belong to each of the 10 possible classes.
Conclusions: CNNs allow to build useful applications in the computer vision domain. In this brief story, two of the main concepts of the CNNs were been illustrated: a convolution filter aims to detect particular features in the image, a Pool layer allows to reduce the information in the image, however maintaining the detected features.