Data Science

Convolutional Neural Network in Deep Learning

Techniques to improve the performance of convolutional models

BHUPENDRA SINGH@IIT Indore

8 min readJun 2, 2021

Advantage of Convolutional Layer in Deep Learning

when working with images

Fewer parameters: A small set of parameters (the kernel) is used to calculate outputs of the entire image, so the model has much fewer parameters compared to a fully connected layer.
The sparsity of connections: In each layer, each output element only depends on a small number of input elements, which makes the forward and backward passes more efficient.
Parameter sharing and spatial invariance: The features learned by a kernel in one part of the image can be used to detect similar patterns in a different part of another image.

During convolutional layering, There are some commonly used techniques

Before we move on, it’s definitely worth looking into two techniques that are commonplace in convolution layers Padding and Strides.

Padding :

Padding does something pretty clever to solve this: pad the edges with extra, “fake” pixels (usually of value 0, hence the oft-used term “zero paddings”). This way, the kernel when sliding can allow the original edge pixels to be at its center while extending into the fake pixels beyond the edge, producing an output the same size as the input.

Strides :

Often when running a convolution layer, you want to output with a lower size than the input. This is commonplace in convolutional neural networks, where the size of the spatial dimensions is reduced when increasing the number of channels. One way of accomplishing this is by using a pooling layer (eg. taking the average/max of every 2×2 grid to reduce each spatial dimension in half). Yet another way to do it is to use a stride:

The idea of the stride is to skip some of the slide locations of the kernel. A stride of 1 means to pick slides a pixel apart, so basically every single slide, acting as a standard convolution. A stride of 2 means picking slides 2 pixels apart, skipping every other slide in the process, downsizing by roughly a factor of 2, a stride of 3 means skipping every 2 slides, downsizing roughly by factor 3, and so on.

after applying the convolution layer we use the ReLU function.

There are benefits of the ReLU function:

1. Introduce non-linearity

2. Speeds up training, faster to compute

Then we, apply pooling to progressively decrease the height & width of the output tensors from each convolutional layer.

There is some advantage of pooling :

Reduce dimensions and computation
Reduce overfitting as there are fewer parameters
Make the model tolerant towards variations and distortions

So convolution and pooling gives location invariant feature detection

Then we continue this process.

This model gives approximately 75% accuracy in classifying images from the CIFAR10 dataset. So let’s do something to improve this model’s accuracy

`We can do many other things to improve the accuracy of our convolutional model :`

1. ResNets(Residual Network)

2. Regularization

3. Data Augmentation

Before explaining all these topics let me clear that after applying all the things in our image classification model, we can achieve over 90% accuracy in less than 5 minutes using a single GPU, you can apply this model in classifying images from the CIFAR10 dataset. In the above three topics we are going to do the following things :

Data normalization
Data augmentation
Residual connections
Batch normalization
Learning rate scheduling
Weight Decay
Gradient clipping
Adam optimizer

Now let’s explain one by one :

There are a few important changes we should make while creating PyTorch datasets for training and validation:

Instead of setting aside a fraction (e.g. 10%) of the data from the training set for validation, we’ll simply use the test set as our validation set. This just gives a little more data to train with. In general, once you have picked the best model architecture & hyperparameters using a fixed validation set, it is a good idea to retrain the same model on the entire dataset just to give it a small final boost in performance.

Channel-wise data normalization: We will normalize the image tensors by subtracting the mean and dividing by the standard deviation across each channel. As a result, the mean of the data across each channel is 0, and standard deviation is 1. Normalizing the data prevents the values from any one channel from disproportionately affecting the losses and gradients while training, simply by having a higher or wider range of values than others.

2. Randomized data augmentations: We will apply randomly chosen transformations while loading images from the training dataset. Specifically, we will pad each image by 4 pixels, and then take a random crop of size 32 x 32 pixels, and then flip the image horizontally with a 50% probability. Since the transformation will be applied randomly and dynamically each time a particular image is loaded, the model sees slightly different images in each epoch of training, which allows it to generalize better.

after that we do

3. Model with Residual Blocks: One of the key changes to our CNN model this time is the addition of the residual block, which adds the original input back to the output feature map obtained by passing the input through one or more convolutional layers.

4. Batch Normalization: After each convolutional layer, we’ll add a batch normalization layer, which normalizes the outputs of the previous layer.

5. Learning rate scheduling: Instead of using a fixed learning rate, if we use a learning rate scheduler, which will change the learning rate after every batch of training, can boost the performance of our model. There are many strategies for varying the learning rate during training, and the one is called the “One Cycle Learning Rate Policy”, which involves starting with a low learning rate, gradually increasing it batch-by-batch to a high learning rate for about 30% of epochs, then gradually decreasing it to a very low value for the remaining epochs. Learn more

6. Weight decay: We also use weight decay, which is yet another regularization technique that prevents the weights from becoming too large by adding an additional term to the loss function.Learn more:

7. Gradient clipping: Apart from the layer weights and outputs, it also helpful to limit the values of gradients to a small range to prevent undesirable changes in parameters due to large gradient values. This simple yet effective technique is called gradient clipping. Learn more:

8. Adam Optimizer: We’ll use the Adam optimizer which uses techniques like momentum and adaptive learning rates for faster training. You can learn more about optimizers here:

when we apply all these things to our model then we get a good machine learning model with good accuracy

Summary:

Here is a summary of what we have done to improve the performance of our model.

Here’s a summary of the different techniques used to improve our model performance and reduce the training time:

Data normalization: We normalized the image tensors by subtracting the mean and dividing by the standard deviation of pixels across each channel. Normalizing the data prevents the pixel values from any one channel from disproportionately affecting the losses and gradients. Learn more
Data augmentation: We applied random transformations while loading images from the training dataset. Specifically, we will pad each image by 4 pixels, and then take a random crop of size 32 x 32 pixels, and then flip the image horizontally with a 50% probability. Learn more
Residual connections: One of the key changes to our CNN model was the addition of the residual block, which adds the original input back to the output feature map obtained by passing the input through one or more convolutional layers. We used the ResNet9 architecture Learn more.
Batch normalization: After each convolutional layer, we added a batch normalization layer, which normalizes the outputs of the previous layer. This is somewhat similar to data normalization, except it’s applied to the outputs of a layer, and the mean and standard deviation are learned parameters. Learn more
Learning rate scheduling: Instead of using a fixed learning rate, we will use a learning rate scheduler, which will change the learning rate after every batch of training. There are many strategies for varying the learning rate during training, and we used the “One Cycle Learning Rate Policy”. Learn more
Weight Decay: We added weight decay to the optimizer, yet another regularization technique that prevents the weights from becoming too large by adding an additional term to the loss function. Learn more

. Gradient clipping: We also added gradient clipping, which helps limit the values of gradients to a small range to prevent undesirable changes in model parameters due to large gradient values during training. Learn more.

Adam optimizer: Instead of SGD (stochastic gradient descent), we used the Adam optimizer which uses techniques like momentum and adaptive learning rates for faster training. There are many other optimizers to choose from and experiment with. Learn more.

Future Work:

As an exercise, you should try applying each technique independently and see how much each one affects the performance and training time. As you try different experiments, you will start to cultivate the intuition for picking the right architectures, data augmentation & regularization techniques.

You should implement the convolution model in your jupyter notebook and do experiments with hyperparameters like num_epochs, learning rate, and optimizer function.

The above-discussed techniques will improve the performance of our convolutional model from around 75% to 90%.

For your help, I am sharing the helping notebook in references.

References:

Reference notebook to start a project on convolution network

2. Jovian.ai Tutorials

3. Particular video on PyTorch

If you love this and found something interesting you can give one clap for me and you can share it with your friends

Thanks for reading and Best of Luck.