The rapid growth of deep learning models, particularly Transformers, has far outpaced hardware scaling, increasing pressure on memory and compute efficiency. While INT8 quantization reduces memory requirements, it often sacrifices accuracy. Microscaling (MX) formats, such as MXIN
...
The rapid growth of deep learning models, particularly Transformers, has far outpaced hardware scaling, increasing pressure on memory and compute efficiency. While INT8 quantization reduces memory requirements, it often sacrifices accuracy. Microscaling (MX) formats, such as MXINT8, address this trade-off by grouping INT8 values with a shared exponent, achieving FP32-level accuracy with up to 4 times memory savings. However, efficient execution of mixed integer–floating-point operations requires specialized hardware. Prior MX accelerators based on systolic arrays are limited by underutilized processing elements or the overhead of FP32 peripheries.
This work presents MXITA, a multi-dimensional systolic array accelerator for MX matrix multiplications in neural network workloads. The architecture introduces parameterization over (M, N, P, Q), enabling trade-offs between supported MX block sizes and FP32 peripheral reuse while sustaining high throughput. MXITA was designed, implemented, and integrated into the Snitch cluster, with verification ensuring functional correctness at both module and system levels.
Synthesis in GF22 technology demonstrates that MXITA achieves higher area efficiency than prior state-of-the-art MX accelerators by amortizing FP32 hardware across compute tiles and reducing periphery overhead. These results highlight the potential of multidimensional systolic arrays as scalable and efficient hardware for MX quantized deep learning workloads.