Continual learning (CL) enables models to learn sequentially from non-stationary data without catastrophic forgetting, which is critical for real-world applications such as robotics and embedded systems. However, implementing CL on edge devices remains challenging due to limited
...
Continual learning (CL) enables models to learn sequentially from non-stationary data without catastrophic forgetting, which is critical for real-world applications such as robotics and embedded systems. However, implementing CL on edge devices remains challenging due to limited memory and computational resources. Replay-based methods, while effective for CL, impose large memory overheads for storing replay samples.
This thesis proposes a compressed latent replay (CLR) framework to enable memory-efficient on-chip CL for both deep neural networks (DNNs) and spiking neural networks (SNNs). The algorithmic design integrates spatial compression via autoencoders to reduce latent dimensionality, temporal compression via spike counting and binning to exploit SNN sparsity, and their two-dimensional combination for further memory reduction. Experiments on MNIST and SHD validate the software framework. On MNIST, two-dimensional compression (TW=20, dim_latent=20) preserves accuracy (89.0% vs 88.0% uncompressed) while reducing replay storage from 245 KB to 62.81 KB, corresponding to 25.6% of the baseline. When the number of replay samples increases, the advantage becomes more pronounced; with 256 samples per class the footprint is 64.37 KB versus 490 KB, corresponding to 13.1% of the uncompressed baseline. On SHD, temporal spike counting maintains strong accuracy under compression and outperforms binning.
To validate system-level feasibility, two FPGA accelerators were implemented on the AMD Kria KV260 with DMA over AXI and AXI-Stream: a Time Decoder and a sparse FC accelerator that supports hardware forward propagation with software backpropagation. The Time Decoder shortens the decoding of 128 × 5 temporally compressed spike sequences from 17.4 s on the Cortex-A53 to 1.2 s (~14.5×). The FC accelerator reduces per-FC computation time on a 784 × 64 layer with batch 128 from 2.240 ms (CPU) to 1.108 ms (P=16) and 0.323 ms (P=64), yielding up to 6.94× speedup. End-to-end latency remains higher due to CPU-side preprocessing and data movement. Resource usage meets KV260 limits, with logic and DSP utilization remaining moderate, while BRAM is the dominant constraint at higher parallelism.
Overall, this work shows that CLR combined with FPGA acceleration and hardware forward with software backpropagation is a practical and scalable route to on-device CL in edge AI systems. Potential extensions include quantized autoencoders, sparsity-aware BRAM allocation for edge-realistic temporal datasets, pipelined or PL side preprocessing to reduce end-to-end latency to strengthen deployment feasibility.