Vanishing empirical variance in randomly initialized networks

Master Thesis (2023)
Author(s)

M.A. Grzejdziak (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

DMJ Tax – Mentor (TU Delft - Pattern Recognition and Bioinformatics)

M. Loog – Graduation committee member (TU Delft - Pattern Recognition and Bioinformatics)

Marcel JT Reinders – Graduation committee member (TU Delft - Pattern Recognition and Bioinformatics)

J.W. Böhmer – Graduation committee member (TU Delft - Algorithmics)

Faculty
Electrical Engineering, Mathematics and Computer Science
Copyright
© 2023 Michał Grzejdziak
More Info
expand_more
Publication Year
2023
Language
English
Copyright
© 2023 Michał Grzejdziak
Graduation Date
12-06-2023
Awarding Institution
Delft University of Technology
Programme
['Computer Science']
Faculty
Electrical Engineering, Mathematics and Computer Science
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Neural networks are commonly initialized to keep the theoretical variance of the hidden pre-activations constant, in order to avoid the vanishing and exploding gradient problem. Though this condition is necessary to train very deep networks, numerous analyses showed that it is not sufficient. We explain this fact by analyzing the behavior of the empirical variance which is more meaningful in practice of data sets of finite size. We demonstrate its discrepancy with the theoretical variance which grows with depth. We study the output distribution of neural networks at initialization in terms of its kurtosis which we find to grow to infinity with increasing depth even if the theoretical variance stays constant. The result of this is that the empirical variance vanishes: its asymptotic distribution converges in probability to zero. Our analysis, which studies increased dependence of outputs, focuses on fully-connected ReLU networks with He initialization, but we hypothesize that many more random weight initialization methods suffer from either vanishing or exploding empirical variance. We support this hypothesis experimentally and demonstrate the failure of state-of-the-art random initialization methods in very deep regimes.

Files

MG_thesis_report_1_.pdf
(pdf | 0.726 Mb)
License info not available