Although transformers are state-of-the-art models for natural language tasks, obtaining reasonable performance still often requires large transformers which are expensive to train and deploy. Fortunately, there are techniques to increase the size of transformers without extra com
...
Although transformers are state-of-the-art models for natural language tasks, obtaining reasonable performance still often requires large transformers which are expensive to train and deploy. Fortunately, there are techniques to increase the size of transformers without extra computing costs. One such technique is sparsity. However, it remains unclear whether sparse architecture is intrinsically more efficient than its dense counterpart. In this paper, we investigate whether replacing the feedforward networks in small transformers with sparse alternatives results in better predictions and faster inference. We found that although inference speed does not increase due to software and hardware limitations, certain sparse alternatives do result in better language understanding. Our research contributes to smarter architectural decision making when designing small language models.