Adapting Mono-Forward with Zeroth-Order Gradient Estimation for Automatic Differentiation-Free Training

Bachelor Thesis (2026)
Author(s)

A. Görpelioğlu (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

Stephanie Tan – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Yaqi Guo – Mentor (TU Delft - Mechanical Engineering)

R.L. Lagendijk – Graduation committee member (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Faculty
Electrical Engineering, Mathematics and Computer Science
More Info
expand_more
Publication Year
2026
Language
English
Graduation Date
24-06-2026
Awarding Institution
Delft University of Technology
Project
CSE3000 Research Project
Programme
Computer Science and Engineering
Faculty
Electrical Engineering, Mathematics and Computer Science
Downloads counter
8
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

The aim of this paper is to explore the potential of adapting the Mono-Forward algorithm with Zeroth-Order Optimization for backpropagation (BP) and automatic-differentiation(AD)-free image classification, assessing its feasibility in scenarios where exact gradients are unavailable. The Mono-Forward method introduces a novel approach to training neural networks without the need for backpropagation or multiple forward passes typically required in forward-forward algorithms; however it still relies on AD for local training of model layers when implemented with modern deep learning frameworks. This work proposes MF+DD, which replaces AD in Mono-Forward with zeroth-order gradient estimation via directional derivatives, resulting in a training algorithm that is free of AD and global BP. This paper also introduces a random projection based modification to adress the limitation of Mono-Forward in architectures with large intermediate activation tensors, for increased computational efficiency. Experiments on MNIST, FashionMNIST, CIFAR-10, and CIFAR-100 with both MLP and CNN architectures show that MF+DD achieves comparable accuracy to MF with AD on simpler datasets, while the accuracy gap widens on more complex benchmarks, suggesting that the noise introduced by the directional derivative estimator becomes more impactful as task difficulty increases. Results further show that increasing the number of perturbation directions P improves both accuracy and training stability with a downside of increased computational cost.

Files

Ates-monodd_2.pdf
(pdf | 0.515 Mb)
License info not available