Adapting Mono-Forward with Zeroth-Order Gradient Estimation for Automatic Differentiation-Free Training
A. Görpelioğlu (TU Delft - Electrical Engineering, Mathematics and Computer Science)
Stephanie Tan – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)
Yaqi Guo – Mentor (TU Delft - Mechanical Engineering)
R.L. Lagendijk – Graduation committee member (TU Delft - Electrical Engineering, Mathematics and Computer Science)
More Info
expand_more
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.
Abstract
The aim of this paper is to explore the potential of adapting the Mono-Forward algorithm with Zeroth-Order Optimization for backpropagation (BP) and automatic-differentiation(AD)-free image classification, assessing its feasibility in scenarios where exact gradients are unavailable. The Mono-Forward method introduces a novel approach to training neural networks without the need for backpropagation or multiple forward passes typically required in forward-forward algorithms; however it still relies on AD for local training of model layers when implemented with modern deep learning frameworks. This work proposes MF+DD, which replaces AD in Mono-Forward with zeroth-order gradient estimation via directional derivatives, resulting in a training algorithm that is free of AD and global BP. This paper also introduces a random projection based modification to adress the limitation of Mono-Forward in architectures with large intermediate activation tensors, for increased computational efficiency. Experiments on MNIST, FashionMNIST, CIFAR-10, and CIFAR-100 with both MLP and CNN architectures show that MF+DD achieves comparable accuracy to MF with AD on simpler datasets, while the accuracy gap widens on more complex benchmarks, suggesting that the noise introduced by the directional derivative estimator becomes more impactful as task difficulty increases. Results further show that increasing the number of perturbation directions P improves both accuracy and training stability with a downside of increased computational cost.