Optimization — Descent Algorithms
In this post, we will see several basic optimization algorithms that you can use in various data science problems.
Many algorithms used in Machine Learning are based on basic mathematical optimization methods. Discovering these algorithms directly in the context of Machine Learning might be confusing because of all the prerequisites. Thus, I think it might be a good idea to see these algorithms free of any context in order to get a better understanding of these techniques.
Descent Algorithms
Descent algorithms are meant to minimise a given function, that’s it. Really. These algorithms proceed iteratively, it means that they successively improve their current solution. You might think:
What if I want to find the maximum of a function ?
Simply, add a minus sign in front of your function, and it becomes a “min” problem!
Let’s dive in. This is our problem definition:
One prerequisite you must know is that if a point is a minimum, maximum, or a saddle point (meaning both at the same time), then the gradient of the function is zero at that point.
Descent algorithms consist of building a sequence {x} that will converge towards x* (arg min f(x)). The sequence is built the following way:
Where k is the iteration, and d is a vector, same size as x, called the descent vector. Then, this is what the algorithm looks like:
x = x_init
while ||gradient(f(x))|| > epsilon:
x = x + d
That’s it! We keep doing the update until the norm of the gradient is small enough (as it should reach a zero value at some extremum).
We will see 3 different descent/direction vectors: Newton’s direction, Gradient’s direction, and Gradient + Optimal Step Size direction. First, we need to define a function that we will try to minimise during our experiments.
Rosenbrock Function
I chose the Rosenbrock function, but you may find many others, here for instance. Another good one would be Himmelblau’s function.
It has a global minimum at (x, y)=(a, a²) where f(a, a²) = 0. I will use a=1, b=100 which are commonly used values.
We will also need, two other pieces of information, the gradient of that function, as well as the hessian matrix.
Let’s open up a file and start a Python script. I will do this in a Google Colab, and all the code used in this post will be available here:
Here is our first piece of code.
From now on, I will refer to the function input vector as x, akin to the problem definition earlier. Now that we are ready, let’s see the first descent vector!
Newton’s direction
Newton’s direction is the following:
So the update is:
Quick note n°1
You can find this updated formula by doing the 2nd order Taylor expansion of f(x + d), since the update we are performing is x_new = x + d.
We want to find d such that f(x + d) is as low as possible. Supposing f’’(x) is positive, this equation is a parabola that has a minimum. That minimum is reached when the derivative of f(x + d) is zero.
In n-dimensions, f’’(x) becomes the hessian matrix, and 1/f’’(x) shows up as the inverse hessian matrix. Finally, f’(x) will be the gradient.
Quick note n°2
We need to compute the inverse of the hessian matrix. For big matrices, this is a very computationally intensive task. Therefore, in practice, we solve this a bit differently, but in a totally equivalent manner.
Instead of computing the inverse of the hessian matrix, we solve this equation for g and make the update rule the following:
Let’s code the algorithm now:
You will notice a small difference with the algorithm I presented at the beginning. I added a max_iteration parameter, so that the algorithm doesn’t run indefinitely if it doesn’t converge. Let’s try it.
We get this result:
x* = [1. 1.]
Rosenbrock(x*) = 0.0
Grad Rosenbrock(x*) = [0. 0.]
Iterations = 2
The algorithm converged in only 2 iterations! That’s really fast. You might think:
Hey, the initial x is very close to the target x*, that makes the task easy!
You’re right. Try with some other values, for instance x_init = [50, -30], the algorithm terminates in 5 iterations.
This algorithm is called the Newton’s Method and all descent algorithms are modifications of this method! It’s kind of the mother formula. The reason why it’s really fast is that it uses second order information (the hessian matrix).
Using the hessian matrix, however, comes at a cost: efficiency. Computing an inverse matrix is a computationally intensive task, so mathematicians came up with solutions to overcome this problem. Mainly: Quasi-Newton methods, and Gradient methods. Quasi-Newton methods try to approximate the inverse of the hessian matrix with various techniques, whereas Gradient methods simply stick to first order information.
Gradient’s direction
If you did some Machine Learning, you’ve probably seen this already. The gradient direction:
Where α is called the step size (or learning rate in ML), and is a real number.
If you have been doing some Machine Learning, now you know this formula is actually part of a bigger one: Newton’s direction, except we replaced the inverse hessian with a constant! The update rule now is:
The algorithm becomes:
Let’s try it out:
You can tweak the values of alpha, epsilon, and max_iterations. In order to get a result similar to the Newton’s method I came up with those. This is the result:
x* = [0.99440769 0.98882419]
Rosenbrock(x*) = 3.132439308613923e-05
Grad Rosenbrock(x*) = [-0.00225342 -0.00449072]
Iterations = 5000
Wow! Gradient descent took 5000 iterations where the Newton’s method took only 2! Moreover, the algorithm didn’t completely reach the minimum point (1, 1).
The main reason for which this algorithm converged so slowly compared to Newton, is that not only we no longer have the information given by the second derivative of f, but we used a constant to replace the inverse hessian.
Think about it. The derivative of a function is the rate of change of that function. So the hessian gives information about the rate of change of the gradient. Since finding the minimum implies necessarily a zero gradient, the hessian becomes super useful as it tells you when the gradient goes up or down.
Many papers in ML are just about finding a better approach for this specific step. Momentum, Adagrad, or Adadelta are some examples.
Gradient’s direction + Optimal step size
One improvement to the classical gradient descent is to use a variable step size at each iteration, not a constant. Not only it’s going to be a variable step size, but it’s also the best possible step size.
The update is:
How do we find α? Since we want this update to be as efficient as possible, i.e. to minimise f as much as possible, we are looking for α such that:
Notice that at this step, x and grad(x) are constants. Therefore, we can define a new function q:
Where q is actually a function of one variable. And we want to find the α that minimises this function. Umm… Gradient descent? We could, but while we’re at it, let’s learn a new method: Golden Section Search.
Golden Section Search aims at finding the extremum (minimum or maximum) of a function inside a specified interval. Since we use α in the range [0, 1], this is the perfect opportunity to use this algorithm.
As this post is starting to be pretty long I’m not going to go into the details. Hopefully, with the help of that magnificent GIF I took ages to make, and the code below, you’ll be able to understand what’s happening here.
Now that we are able to find the best α, let’s code gradient descent with optimal step size!
Then, we can run this code:
We get the following result:
x* = [0.99438271 0.98879563]
Rosenbrock(x*) = 3.155407544747055e-05
Grad Rosenbrock(x*) = [-0.01069628 -0.00027067]
Iterations = 3000
Even though in this case the results are not significantly better than pure gradient descent, generally the optimal step size performs better. For instance, I tried the same comparison with Himmelblau’s function, and gradient descent with optimal step size was more than twice as fast as pure gradient descent.
Conclusion
This is the end of this post. I hope you learned some new things that triggered your curiosity for mathematical optimization! There are tons of other interesting methods. Go find them! Don’t forget to check out the Google Colab file, you will find all the code used and the same tests we did here with Himmelblau’s function. Don’t hesitate to leave a comment, and until next time, peace! :)