Neural networks are typically initialized such that the hidden pre-activations’ theoretical variance remains constant to avoid the vanishing and exploding gradient problem. This condition is necessary to train very deep networks, but numerous analyses show this to be insufficient
...
Neural networks are typically initialized such that the hidden pre-activations’ theoretical variance remains constant to avoid the vanishing and exploding gradient problem. This condition is necessary to train very deep networks, but numerous analyses show this to be insufficient. We explain this behavior by analyzing the empirical variance, which is more meaningful in the practical setting that deals with data sets of finite size. We demonstrate its discrepancy with the theoretical variance, which grows with depth. We study the output distribution of neural networks at initialization and find that its kurtosis grows to infinity with increasing depth, even if the theoretical variance stays constant. As a result, the empirical variance vanishes: its asymptotic distribution converges in probability to zero. Our analysis focuses on fully connected ReLU networks with He-initialization, but we hypothesize that many more random weight initialization methods suffer from vanishing or exploding empirical variance. We support this hypothesis experimentally and demonstrate the failure of state-of-the-art random initialization methods in very deep regimes.