This post was inspired by an assignment submitted in the course Optimization by Yiftach Neumann and I.
In the first part, we described common optimization concerns in machine learning. Then, we introduced the three widely used optimization methods - Momentum, RMSProp and ADAM.
In this second (and last) part, we implement the aforementioned algorithms, show simulations and compare them.
Algorithms, Implementations and Evaluation
We provide implementation of the aforementioned algorithms in the python notebook:
Evaluation
We can evaluate our algorithm by observing how well it performs the optimization on various functions. Optimizers usually have hyper-parameters, such as learning rate. Those hyper-parameters have dramatic effect on the optimizer behaviour, as clearly seen in the figure to the right.
For this reason, in our work we will compare optimizers after having tuned the optimizer hyper-parameters - we made an exhaustive search of \(\sim50\) different values for each parameter, and chose the parameters that yielded the best value in the smallest number of iterations. The search was computationally expensive and might not be feasible over large data-sets, thus the results shown might be better than those seen in practice. Indeed, it is possible that while using the best hyper-parameters some algorithms perform similarly, but in most cases one is superior to another. Alas, it is still the best comparison method available in order to give each optimizer a fair try.
Low Dimensional Functions
We implemented 3 simple and complex functions over \(\mathbb{R}\) or \(\mathbb{R}^{2}\):
- Cubic Function \(\left(x^{2}\right)\),
- “Mean Squared Error” \(\left(x-y\right)^{2}\), and
- “Hard Polynomial” \(\left(x^{2}-3\sin\left(x\right)x+1\right)\) , which is drawn in the figure to the right.
We then ran every optimizer on each of these functions, starting at different initial points. As mentioned before, all optimizers’ hyper-parameters were tuned. The results are detailed in Appendix A.
The results were quite surprising. On the simpler functions (Cubic and MSE), all of the optimizers converged to the global minimum. RMSProp and ADAM found an insignificantly lower spot, albeit at the price of a much larger number of iterations. Thus, we can assert that, on simple convex functions, Gradient Descent and Momentum will converge faster, without loss in performance.
On the more challenging function, “Hard Polynomial”, there was a clear advantage to Momentum. On simple initial points, for example \(f\left(2\right)\), all of the optimizers reached the global minimum. But on points farther up the slope of the function, such as \(f\left(401\right)\), it was the only algorithm that consistently converged to the global minimum, with a reasonable number of iterations. Gradient Descent, with proper choice of learning rate, performed nicely as well, for most of the initial points.
It is also clear that near the minimum point, ADAM and RMSProp converge slowly, a behaviour that was consistent in all of the functions and initial points, highlighting the importance of learning rate decay methods.
High Dimensional Cubic Functions
Next, we compared the optimizers on high dimensional cubic function, for example \(x_{1}^{2}+x_{2}^{4}+...+x_{n}^{2}\). These functions are interesting because one of the main motivations behind Momentum, RMSProp and ADAM is to allow different change in each dimension, in order to suppress oscillations in dimensions that have steeper slope - as this behaviour allows for fewer oscillations and better learning rate overall.
The results does not reflect the intuition: Gradient Descent has the lowest number of iterations by far, and ADAM consistently converges the slowest. With respect to performance, ADAM still converges to a lower point than the rest of the optimizers. Momentum has superior results than Gradient Descent, albeit with a small increase in iterations number, and RMSProp performs the worst with a very large number of iterations. For the full table of results, please refer to the appendix.
One of the main theoretic advantages of RMSProp and ADAM is the ability to choose a larger learning rate, while reducing oscillations in dimensions with steeper slopes. We specifically choose these functions to highlight that behaviour. Alas, the plot above clearly shows that Momentum and drop much faster towards the local minima, which raises questions about the correctness of this theoretical assumption. It is possible that ADAM popularity in the Deep Learning domain is due to another advantage that plays a role in the structure of loss functions surfaces over the unique data-sets of a particular deep learning domain.
High Dimensional Hard Polygon
We implemented a function that is highly challenging for optimizers, as seen from the plot to the left:
The Function is defined as \(x_{1}^{2}+x_{2}^{2}x_{1}\cos\left(x_{2}\right)+\ldots+x_{n}^{2}x_{1}cos(x_{n})+2n\). In many cases, the optimizers got lost in a “bottom-less pit”, or missed a local minima and yielded very poor results.
Its shape over \(\mathbb{R}^{2}\) indicates how complicated this function is over higher dimensional domains. Alas, the hyper-parameters search proved itself and the optimizers dealt with the complexity, even in higher dimensions such as 5,6, and 12.
For all the initial points we checked, ADAM performed the best by a large margin and with a good number of iterations.
Conclusion
We implemented 4 optimizers that are highly popular within the Machine Learning community, and tested them on simple and complex functions, in order to test their performances in a sterile environment, which is perhaps simpler than the usual real data-set loss-function setting, and thus to gain insights about their behaviour.
It was clear that in simple convex functions, the simpler Gradient Descent and Momentum performed best. The theoretical ability of ADAM and RMSProp to use high learning rate while avoiding oscillations did not came into effect even in a setting that was supposed to give it a clear advantage. Yet, on a really complicated function, Adam consistently performed best in terms of converging to a lower minima while keeping low iterations number. This suggests that ADAM performs best over complicated functions, which may explain why it is so successful in the Deep Learning real-practice domain. Our work also highlighted the importance of hyper-parameter tuning when using optimizers, and the importance of learning rate decay methods.
Bibliography
Adam Optimization Algorithm (C2W2L08)
How to implement an Adam Optimizer from Scratch
An Overview of Machine Learning Optimization Techniques
A Survey of Optimization Methods from a Machine Learning Perspective
Python notebook of gradient descent
ADAM: A METHOD FOR STOCHASTIC OPTIMIZATION
Barzilai, Jonathan; Borwein, Jonathan M. (1988). Two-Point Step Size Gradient Methods. IMA Journal of Numerical Analysis. 8 (1): 141. doi:10.1093/imanum/8.1.141.
Fletcher, R. (2005). On the BarzilaiBorwein Method. In Qi, L.; Teo, K.; Yang, X. (eds.). Optimization and Control with Applications. Applied Optimization. Vol. 96. Boston: Springer. pp. 235. ISBN 0-387-24254-6.
Comments