The advent of transformer architectures revolutionized natural language processing, particularly with the popularity of decoder-only transformers for text generation tasks like GPT models. However, the autoregressive nature of these models challenges their inference speed, crucia
...
The advent of transformer architectures revolutionized natural language processing, particularly with the popularity of decoder-only transformers for text generation tasks like GPT models. However, the autoregressive nature of these models challenges their inference speed, crucial for real- time applications and resource-constrained environments. Memory bandwidth is a significant bottleneck, especially in autoregressive decoding, where constant loading of large key and value tensors dominates. The Multi-Query Attention (MQA) architecture was proposed to reduce memory access by shrinking the key-value cache, enhancing infer- ence speed at the cost of generation quality. Grouped-Query Attention (GQA) was introduced to mitigate this quality decline, serving as an interpolation between Multi- Head Attention (MHA) and MQA. We explore the trade- offs between inference speed and quality in decoder-only models by experimenting with various proportions of query groups relative to attention heads during pre-training. Additionally, we investigate the impact of reducing the size of key and value vectors compared to GQA and explore a hybrid method combining GQA with shortened key-value vectors. This study aims to expand the list of possible trade-offs and help select an optimal architecture based on specific needs.