A gentle introduction to Gradient Descent
One way to think about machine learning is as a series of “Optimization” problems i.e. finding the minimum or maximum of a mathematical function (often called a loss function). Gradient descent is a workhorse for optimization and is often employed in neural networks.
Imagine that you are hiking through a landscape that looks like the Chocolate Hills of the Philippines and that your objective is to get to the lowest point amongst these hills.
But wait a second! What does this have to do with machine learning? In fact, it is so fundamental, that it is easy to miss the intuition! Let’s say you are trying to build a predictive model. The way you would build the perfect model is by comparing your prediction against the actual data and constructing what is known as a loss function. Simplistically, that could be :-
Loss function = Your prediction – Actual data
Optimization in this scenario would be to find the model that reduces the error between your prediction and the actual data. Said in other words, optimization aims to find the model that provides the lowest possible value of the loss function. This hilly landscape can be imagined as the error for various models and finding the lowest point in the landscape corresponds to the perfect model!
The ‘objective’ function you are trying to optimize during the hike is your height above sea level. The parameters of this function, known as the objective, are your latitude and longitude. Gradient descent offers a set of techniques to iteratively locate the lowest (or highest) point. Using gradient descent, you can figure out which direction to take your steps in and how long each step should be.
Say that you’ve stopped to take in the beautiful view of the hills and you stretch one foot out in every direction around you to feel the ground. The tilt of your feet as it lands on the ground is the gradient or the slope. This is exactly the ‘gradient’ in gradient descent too!
In logical terms, all of gradient descent can be encapsulated within the following few lines:-
Set initial random parameters (say latitude, longitude = 43N,45W)
Calculate height above sea level at these initial parameters (objective function)
Repeat until change in objective function is very small or you’re tired of walking (number of steps has reached a large value):
Compute gradient with given parameter values (lat, long)
New parameters = Old parameters – Learning rate * Gradient
Calculate height above sea level (objective function) at these new parameters (new objective function)
Is the change in your height above sea level very small? If so, then stop
The learning rate used above determines how fast you arrive at a solution (whether optimal or not). A large learning rate would imply that the length of each step you take during your hike is large. You would ‘learn’ quickly with a large learning rate, but you also risk missing the lowest point in the hill, because you may have just stepped over it. A smaller learning rate could be exhaustive, but might make for a very slow process. Imagine a 6 ft human being hiking through these hills versus an 18ft troll. The human being might cover the entire hilly landscape thoroughly but take months, while a troll might miss the lowest point, but take only a few hours.
There are three variants in gradient descent and they all differ in what information you use to determine your direction of travel and length of step.
Batch gradient descentThe gradient in batch gradient descent is calculated using the entire available training data set. Although this ensures that for convex landscapes a global minima will always be found, you can imagine it being very slow. Before every step you (the hiker) takes, you compute the gradient based on all available data.
Stochastic gradient descentIn this case, before every step that you, as the hiker, take, you stochastically (or randomly) select a single training data point and use it to compute the gradient. This can lead to a lot of variation and less evenly taken steps compared to batch gradient descent since now, the new step depends on a randomly chosen data point. The nice thing though is that it is much faster than the batch method.
Mini-batch gradient descentThis is the chicago blend popcorn of the gradient descent world! It uses small batches of training examples to compute the gradient (not all like in batch gradient descent and not one like in stochastic gradient descent). A combination of stability of parameter updates as well as manageable computation time makes this technique an attractive option when choosing gradient descent for optimization problems.