The Subtle Art of Fixing and Modifying Learning Rate

An introduction to learning rate hyper-parameter and principles and procedures to find a good starting value and adapting it over the course of training.

Abinash Mohanty
Picture from internet

Learning rate is one of the most critical hyper-parameters and has the potential to decide the fate of your deep learning algorithm. If you mess it up, then the optimizer might not be able to converge at all! It acts as a gate which controls how much the optimizer is updating the parameters w.r.t. gradient of loss. Speaking in terms of equations, gradient is given by:

Using this gradient from the mini-batch, stochastic gradient descent follows the estimated downhill:

where ϵ is the learning rate. The following figure explains the effects of learning rate on gradient descent. A very small learning rate will make gradient descent take small steps even if the gradient is big, thus slowing the process of learning. If the learning rate is high, then if becomes impossible to learn very small changes in the parameters needed to fine tune the model towards the end of the training process, so the error flattens out very early. If the learning rate is very high, then gradient descent takes big steps and jumps around. This can lead to divergence and thus increase the error.

Selecting a good starting value for learning rate:

Now the question arises, what is the best value of the learning rate and how to decide it ? A systematic way to estimate a good learning rate is by training the model initially with a very low learning rate and increasing it (either linearly or exponentially) at each iteration (illustrated below). We keep doing it to the point where the loss stops decreasing and starts to increase. That means that the learning rate is too high for the application and so gradient descent is diverging. For practical applications our learning rate should ideally be 1 or 2 step smaller than this value.

If we keep track of the learning rate and plot log of the learning rate and the error we will see a plot as shown below. A good learning rate somewhere to the left to the lowest point of the graph (as demonstrated in below graph). In this case, its 0.001 to 0.01.

In general no fixed learning rate works best for the entire training process. Typically we start with a learning rate found using the method described above. During the training process we change learning rate to best facilitate learning. There are many different ways to accomplish this. In this blog, we will go through a few popular learning rate scheduler.

Step Decay

Step decay schedule drops the learning rate by a factor every few epochs. Mathematical form of step decay is:

where, ϵ_{k} is the learning rate for k_{th} epoch, ϵ_{0} is the initial learning rate, α is the fraction by which learning rate is reduced, ⌊ . ⌋ is floor operation and N is the number of epochs after which learning rate is dropped.

In Tensorflow, this can be done easily. To modify the learning rate we need a variable to store the learning rate and a variable to store the number of iterations.

Linear or Exponential Time-Based Decay

This technique is also known as learning rate annealing. We start with a relatively high learning rate and then gradually lower it during training. The intuition behind this approach is that we’d like to traverse quickly from the initial parameters to a range of “good” parameter values but then we’d like a learning rate small enough that we can explore the “deeper, but narrower parts of the loss function” (fine tuning the parameters to get best results).

In practice, it is common to decay the learning rate until iteration τ. In case of linear decay, the learning rate is modified in the following manner:

with α=frac{k}{τ}. After iteration τ, it is common to leave ϵ constant.

In case of exponential decay:

In Tensorflow, this can be implemented like we implemented step decay. In this case, we make staircase=False, this uses a floating division and thus leads to gradual decrease in learning rate.

Decrease learning rate when you hit a plateau

This technique is also very popular and its intuitive also. Keep using a big learning rate to quickly approach a local minima and reduce it once we hit a plateau (i.e. this learning rate is too big for now, we need smaller value to be able to fine tune the parameters more). The term plateau refers to the point when the change in loss w.r.t. training iterations is less then a threshold θ. What it essentially means is the loss vs iterations curve becomes flat. This is illustrated in the figure below.

These sort of custom learning rate decay scheduler can be easily implemented by making the learning rate a placeholder. We then calculate the learning rate based on some set of rules and pass it to Tensorflow in the feed_dict along with other data (input, output, dropout ratio, etc).

Cyclic learning rates

All the schemes we discussed so far were targeted at starting with a large learning rate and making it smaller as training progressed. some works like Cyclical Learning Rates for Training Neural Networks and Stochastic Gradient Descent with Warm Restarts suggest otherwise. The underlying assumption behind cyclic learning rate is “ increasing the learning rate might have a short term negative effect and yet achieve a longer term beneficial effect”. The authors of these works have demonstrated that a cyclical learning rate schedule which varies between two bound values can deliver results better than the traditional learning rate schedules. The intuition behind why cyclic learning rates work are:

  1. A minima that generalizes data well should not be a sharp minima, i.e. slight change in the parameters should not degrade the performance. By allowing for our learning rate to increase at times, we can “jump out” of sharp minima which would temporarily increase our loss but may ultimately lead to convergence on a more desirable minima. Look at this article for a good counter-argument.
  2. Increasing the learning rate can also allow for “more rapid traversal of saddle point plateaus. As you can see in the image below, the gradients can be very small at a saddle point. Because the parameter updates are a function of the gradient, this results in our optimization taking very small steps; it can be useful to increase the learning rate here to avoid getting stuck for too long at a saddle point. So periodic increase in learning rate can help in quickly jumping saddle points.

The codes for cyclic learning rate can be found here.

As an end note, there is no single learning rate schedule scheme that works best for all applications and architectures. I almost always start my experiments with the linear decay scheme (decaying the learning rate until some iteration and then keeping it constant). You can always start with some scheme and if the training error doesn’t go down as expected, try some other scheme. Happy coding.