Efficient Neural Architecture Search for Language Modeling

Master Thesis (2019)
Author(s)

M. Li (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

Frans A Oliehoek – Mentor (TU Delft - Interactive Intelligence)

Wei Pan – Graduation committee member (TU Delft - Robot Dynamics)

Jan van Gemert – Graduation committee member (TU Delft - Pattern Recognition and Bioinformatics)

H. Zhou – Graduation committee member (TU Delft - Robot Dynamics)

Faculty
Electrical Engineering, Mathematics and Computer Science
Copyright
© 2019 Mingxi Li
More Info
expand_more
Publication Year
2019
Language
English
Copyright
© 2019 Mingxi Li
Graduation Date
21-08-2019
Awarding Institution
Delft University of Technology
Programme
['Electrical Engineering | Embedded Systems']
Faculty
Electrical Engineering, Mathematics and Computer Science
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Neural networks have achieved great success in many difficult learning tasks like image classification, speech recognition and natural language processing. However, neural architectures are hard to design, which requires lots of knowledge and time of human experts. Therefore, there has been a growing interest in automating the process of designing neural architectures. Though these searched architectures have achieved competitive performance on various tasks, the efficiency of NAS still needs to be improved. Moreover, current neural architecture search approach disregards the dependency between a node and its predecessors and successors.
This thesis builds upon BayesNAS which employs the classic Bayesian learning method to search for CNN architectures, and extends it to the problem of neural architecture search for recurrent architectures. Hierarchical sparse priors are used to model the architecture parameters to alleviate the dependency issue. Since the update of posterior variance is based on Laplace approximation, an efficient method to compute the Hessian of recurrent layer is proposed. We can find candidated architectures after training the over-parameterized network for only one epoch. Our experiments on Penn Treebank and WikiText-2 show that competitive architectures can be found in 0.3 GPU days using a single GPU for language modeling task. We find that our algorithm is more efficient than state-of-the-art.

Files

Thesis_Final.pdf
(pdf | 1.11 Mb)
License info not available