A
gentle introduction to Gradient Descent
One way to think
about machine learning is as a series of “Optimization” problems
i.e. finding the minimum or maximum of a mathematical function (often
called a loss function). Gradient descent is a workhorse for
optimization and is often employed in neural networks.
Imagine that you are
hiking through a landscape that looks like the Chocolate Hills of the
Philippines and that your objective is to get to the lowest point
amongst these hills.
But wait a second!
What does this have to do with machine learning? In fact, it is so
fundamental, that it is easy to miss the intuition! Let’s say you
are trying to build a predictive model. The way you would build the
perfect model is by comparing your prediction against the actual data
and constructing what is known as a loss function. Simplistically,
that could be :
Loss
function = Your prediction – Actual data
Optimization in this
scenario would be to find the model that reduces the error between
your prediction and the actual data. Said in other words,
optimization aims to find the model that provides the lowest possible
value of the loss function. This hilly landscape can be imagined as
the error for various models and finding the lowest point in the
landscape corresponds to the perfect model!
The ‘objective’
function you are trying to optimize during the hike is your height
above sea level. The parameters of this function, known as the
objective, are your latitude and longitude. Gradient descent offers a
set of techniques to iteratively locate the lowest (or highest)
point. Using gradient descent, you can figure out which direction to
take your steps in and how long each step should be.
Say that you’ve
stopped to take in the beautiful view of the hills and you stretch
one foot out in every direction around you to feel the ground. The
tilt of your feet as it lands on the ground is the gradient or the
slope. This is exactly the ‘gradient’ in gradient descent too!
In logical terms,
all of gradient descent can be encapsulated within the following few
lines:
Set initial random parameters (say latitude,
longitude = 43N,45W)
Calculate height above sea level at these
initial parameters (objective function)
Repeat until change in objective function is
very small or you’re tired of walking (number of steps has reached
a large value):
Compute
gradient with given parameter values (lat, long)
New parameters = Old parameters – Learning
rate * Gradient
Calculate height above sea level (objective
function) at these new parameters (new objective function)
Is the change in
your height above sea level very small? If so, then stop
The learning rate
used above determines how fast you arrive at a solution (whether
optimal or not). A large learning rate would imply that the length of
each step you take during your hike is large. You would ‘learn’
quickly with a large learning rate, but you also risk missing the
lowest point in the hill, because you may have just stepped over it.
A smaller learning rate could be exhaustive, but might make for a
very slow process. Imagine a 6 ft human being hiking through these
hills versus an 18ft troll. The human being might cover the entire
hilly landscape thoroughly but take months, while a troll might miss
the lowest point, but take only a few hours.
There are three
variants in gradient descent and they all differ in what information
you use to determine your direction of travel and length of step.

Batch gradient descentThe gradient in batch gradient descent is calculated using the entire available training data set. Although this ensures that for convex landscapes a global minima will always be found, you can imagine it being very slow. Before every step you (the hiker) takes, you compute the gradient based on all available data.

Stochastic gradient descentIn this case, before every step that you, as the hiker, take, you stochastically (or randomly) select a single training data point and use it to compute the gradient. This can lead to a lot of variation and less evenly taken steps compared to batch gradient descent since now, the new step depends on a randomly chosen data point. The nice thing though is that it is much faster than the batch method.

Minibatch gradient descentThis is the chicago blend popcorn of the gradient descent world! It uses small batches of training examples to compute the gradient (not all like in batch gradient descent and not one like in stochastic gradient descent). A combination of stability of parameter updates as well as manageable computation time makes this technique an attractive option when choosing gradient descent for optimization problems.