[Ignore] Blabber: Doubly bored by the tediously long never ending reports that I ‘have’ to write for my assignments, I decided to give myself an overdue short break and learnt Keras.
Keras is a fairly simple library to start with for deep learning. I found it quite annoying though, because it abstracted so many essential mathematical steps in computation, something which I did not want after having to compute gradients for Batch Normalization just last week for a coursework.
In this post I will elaborate on how effective Batch Normalization can be in developing more robust models with higher learning rates without compromising much of accuracy. I am using leaf segmentation data set available on Kaggle for leaf classification competition.
Once we have loaded data set into our favourite variables, we can proceed to making our machine learning model. Here we will make a sequential fully connected neural network. A sequential network is one whose current output depends on the output of the previous block. That is the output of previous block serves as the input to current block. This is is accompanied by some more calculations in the current block to produce the output, which may or may not be input to another block. Figure 1 illustrates the same. The output of Block 1 is used as input to Block 2 and so on.
Similarly, we will make a neural model, which has first layer equal to the number of features, then 2 hidden layers with 100 units each, and lastly our output layer. The weights have been initialised using a uniform distribution. Let’s look at some code to make things more clear.
model = Sequential()
model.add(Dense(192, input_dim=192, init='uniform'))
model.compile(loss='categorical_crossentropy', optimizer=sgd, metrics=['categorical_crossentropy'])
Here we have defined a sequential model using Keras Sequential function. Next we add the layers to it. Firstly, we add the input layer with number of units equal to number of features. Then few hidden layers with 100 units each and activation function set to Rectified Linear function. We are using Stochastic Gradient Descent as our optimiser, which I have passed as a parameter to this function, for reasons clear later. Since this is a multi class classification problem, I am using Cross Entropy as loss function as well as to calculate accuracy. James has explained beautifully here as why this is better idea than using Mean squared error as an accuracy measure.
Since it is a multi class classification problem where we have to classify our data into one of the possible 99 leaves, I have used one hot encoding and transformed our labels(type of leaf) . One hot encoding creates a sparse vector equal to the length of number of classes and sets an element in vector to one. This vector represents that particular class. Similarly we do it for all other class. This way we can numerically represent all our string classes, and also compute the probability of each while training/testing our model, details of which are beyond this post.
We see that we have added a layer called Batch Normalization in our model. What does it do?
A major problem in sequence to sequence model is that output of each layer is dependent on each layer, this leads to correlation between the layers. Thus if we make changes in any one layer, all the layers will get effected by it. This is also known as Internal Covariate Shift. Batch normalization addresses this problem by normalising the output of each layer in mini-batches, which decreases the effect of changes in one layer in the next. This also leads to allowing us to use higher learning rate o train our model. We have tested this hypothesis here by training to similar models, one with batch normalization and other without, with varying learning rate. Following code illustrates this idea. It took me nearly 12 mins to run that code on an i5 machine with 8GB ram:
for i in range (0,7):
alpha = 0.001*(10**i)
sgd = SGD(lr=alpha, decay=1e-6, momentum=0.9, nesterov=True)
model = modelBN(sgd)
model = modelNBN(sgd)
Finally we plot our results, and observe in Figure 2 that this indeed is true. Increase in alpha does not harm Batch normalised model, however the model with no batch normalisation has it’s cross entropy oscillating a lot with varying learning rate. And increased learning rate without at the cost of efficiency would help us model train faster.
Some other things which could have been done include:
- varying parameters of batch normalisation
- trying with different number of hidden layers and hidden units
I have added code on github.
Have spent more than a day on this now, will update with more details.