Hey Subhan, I guess you’re talking about the convergence of the gradient descent algorithm.

1 min readApr 4, 2019

Hey Subhan, I guess you’re talking about the convergence of the gradient descent algorithm. Within the right conditions, the algorithm converges almost surely to a local minimum at least : https://en.wikipedia.org/wiki/Stochastic_gradient_descent#Iterative_method.

I’m probably not enough informed on the subject to give you mathematical proofs, so I’ll let you search for yourself.

That said, I can tell you how to improve the network in many ways :

In the current version of the code, we set a number of epochs for the network to train on. We could instead set a stop-condition on the error (when the error is below some number, stop the training).
We can improve convergence by summing the gradients over a batch of input and then updating (this is called mini-batch gradient descent). Currently, we update the parameters after every input (stochastic gradient descent, or mini-batch=1).
We currently use a constant learning rate. Which is not ideal because as the parameters converge to specific values we want the update to be smaller and smaller. Basically we need an adaptive learning rate. You can look for Momentum, Adagrad, Adadelta, Adam. This paragraph gives a good insight on what the Momentum is about : https://en.wikipedia.org/wiki/Stochastic_gradient_descent#Momentum

I hope you’ll find answers to your questions !

Thank you for the feedback !

Written by Omar Aflak