Efficient Neural Architecture Search for Language Modeling
M. Li (TU Delft - Electrical Engineering, Mathematics and Computer Science)
Frans A Oliehoek – Mentor (TU Delft - Interactive Intelligence)
Wei Pan – Graduation committee member (TU Delft - Robot Dynamics)
Jan van Gemert – Graduation committee member (TU Delft - Pattern Recognition and Bioinformatics)
H. Zhou – Graduation committee member (TU Delft - Robot Dynamics)
More Info
expand_more
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.
Abstract
Neural networks have achieved great success in many difficult learning tasks like image classification, speech recognition and natural language processing. However, neural architectures are hard to design, which requires lots of knowledge and time of human experts. Therefore, there has been a growing interest in automating the process of designing neural architectures. Though these searched architectures have achieved competitive performance on various tasks, the efficiency of NAS still needs to be improved. Moreover, current neural architecture search approach disregards the dependency between a node and its predecessors and successors.
This thesis builds upon BayesNAS which employs the classic Bayesian learning method to search for CNN architectures, and extends it to the problem of neural architecture search for recurrent architectures. Hierarchical sparse priors are used to model the architecture parameters to alleviate the dependency issue. Since the update of posterior variance is based on Laplace approximation, an efficient method to compute the Hessian of recurrent layer is proposed. We can find candidated architectures after training the over-parameterized network for only one epoch. Our experiments on Penn Treebank and WikiText-2 show that competitive architectures can be found in 0.3 GPU days using a single GPU for language modeling task. We find that our algorithm is more efficient than state-of-the-art.