Vanishing empirical variance in randomly initialized networks

None, None

Vanishing empirical variance in randomly initialized networks

Master Thesis (2023)

Author(s)

M.A. Grzejdziak (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

DMJ Tax – Mentor (TU Delft - Pattern Recognition and Bioinformatics)

M. Loog – Graduation committee member (TU Delft - Pattern Recognition and Bioinformatics)

Marcel JT Reinders – Graduation committee member (TU Delft - Pattern Recognition and Bioinformatics)

J.W. Böhmer – Graduation committee member (TU Delft - Algorithmics)

Faculty

Electrical Engineering, Mathematics and Computer Science

Copyright

Neural networks Kurtosis Initialization

To reference this document use:

https://resolver.tudelft.nl/uuid:6f58ac25-3091-4efa-b1d6-b616631ab378

More Info

expand_more

Publication Year

2023

Language

English

Copyright

Graduation Date

12-06-2023

Awarding Institution

Delft University of Technology

Programme

['Computer Science']

Faculty

Electrical Engineering, Mathematics and Computer Science

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Neural networks are commonly initialized to keep the theoretical variance of the hidden pre-activations constant, in order to avoid the vanishing and exploding gradient problem. Though this condition is necessary to train very deep networks, numerous analyses showed that it is not sufficient. We explain this fact by analyzing the behavior of the empirical variance which is more meaningful in practice of data sets of finite size. We demonstrate its discrepancy with the theoretical variance which grows with depth. We study the output distribution of neural networks at initialization in terms of its kurtosis which we find to grow to infinity with increasing depth even if the theoretical variance stays constant. The result of this is that the empirical variance vanishes: its asymptotic distribution converges in probability to zero. Our analysis, which studies increased dependence of outputs, focuses on fully-connected ReLU networks with He initialization, but we hypothesize that many more random weight initialization methods suffer from either vanishing or exploding empirical variance. We support this hypothesis experimentally and demonstrate the failure of state-of-the-art random initialization methods in very deep regimes.

Files

MG_thesis_report_1_.pdf

(pdf | 0.726 Mb)

License info not available