Text-to-speech (TTS) systems have achieved substantial improvements in quality with the use of neural architectures. Despite these quality improvements, they remain computationally intensive, which limits their deployment on low-power devices. This thesis explores whether a custo
...
Text-to-speech (TTS) systems have achieved substantial improvements in quality with the use of neural architectures. Despite these quality improvements, they remain computationally intensive, which limits their deployment on low-power devices. This thesis explores whether a custom hardware accelerator can improve the efficiency of end-to-end TTS inference while maintaining or improving synthesis quality.
A phoneme-to-mel model based on EfficientSpeech is adapted with hardware-aware modifications, such as the replacement of expensive normalisation layers and simplification of activations. These modifications reduce the computational complexity while improving synthesis quality compared to the baseline model. The HiFi-GAN vocoder is optimised by reducing its model size from 920k to 296k parameters, while maintaining a fair quality. Combined, the complete TTS pipeline is reduced from 1.2M parameters to 562K, demonstrating that substantial reductions in both model size and complexity are achievable without sacrificing the synthesis quality.
A custom instruction set and hardware accelerator are designed to support the essential operations used in TTS inference, which include matrix operations, one-dimensional convolutions, and activation functions. The accelerator was synthesised and implemented on an FPGA using an out-of-context mode, achieving a post-implementation clock period of 4.182 ns, while reporting a power consumption of 1.569 W. The reported power consumption is approximately 9.9 times lower than the used baseline GPU system.
An end-to-end TTS inference accelerator is successfully implemented and verified, achieving a consistent real-time factor (RTF) of approximately 0.156 over varying input lengths, confirming real-time inference capability. For the longest test input of 31 phonemes, the accelerator uses 11.3 times less energy than the baseline system, despite the slightly higher RTF compared to the baseline’s 0.172. For shorter inputs, the energy efficiency improves further, with the accelerator consuming over 50 times less energy for a 4-phoneme test case.
This work demonstrates that dedicated hardware accelerators combined with hardware-aware model optimisations can significantly improve the efficiency of real-time neural TTS, enabling deployment on low-power devices such as embedded systems and battery-powered platforms that typically have strict power budgets and cannot rely on continuous mains power.
Audio samples of the optimised TTS models can be found at a demo page, and the hardware accelerator implementation is available open-source at the specified repository.