|
STM32N6 NPU Deployment — Politecnico di Milano
1.0
Documentation for Neural Network Deployment on STM32N6 NPU - Politecnico di Milano 2024-2025
|
A cross-model comparative analysis connecting the architecture decisions of the Neural-ART NPU to the performance numbers we measured. This chapter answers the central question of the project: how well does a CNN-centric NPU handle different neural network architectures?
The most important finding of this project is not a single number — it is a pattern. The NPU offload rate is not a function of model size, weight count, or number of parameters. It is a function of which operations the model uses. The three models we deployed demonstrate this with unusual clarity.
The 30-point gap between MoveNet (94.7%) and TinyBERT (64.4%) is not due to model size — TinyBERT actually has fewer weights (2.06 MB vs 2.92 MB). It is entirely due to the operations required by each architecture:
The latency numbers reveal something important: a small number of SW epochs can dominate the total inference time far beyond their numerical proportion.
Both CNN models achieve real-time performance (below 33 ms for 30 fps). TinyBERT exceeds 100 ms despite having fewer total weights than MoveNet. The reason is the cost structure of the two execution modes:
An EC epoch runs entirely in the Neural-ART hardware. The STRENG stream engines prefetch weights from OctoFlash while the CONVACC units compute — weight loading and computation overlap. For a typical Conv layer: ~0.15–0.3 ms.
A SW epoch runs on the Cortex-M55 with Helium SIMD. It requires CPU cycles, memory bandwidth, and cache invalidation when crossing NPU/CPU boundaries. Softmax over a 512-token sequence: ~5–15 ms per occurrence.
Memory usage reveals a counterintuitive result: TinyBERT uses significantly less SRAM than MoveNet, despite having more total epochs.
| Memory region | MoveNet | YOLOv8n | TinyBERT | Notes |
|---|---|---|---|---|
| cpuRAM2 | 864 KB | ~900 KB | 0 KB | SW epoch buffers for resize ops |
| npuRAM4 | 378 KB | ~400 KB | 0 KB | nn_in buffer (camera frame) |
| npuRAM5 | 432 KB | ~448 KB | 251 KB | nn_out + intermediate activations |
| octoFlash | 2.924 MB | ~3.2 MB | 2.062 MB | Model weights (read-only) |
| Total | 4.559 MB | ~5.0 MB | 2.308 MB |
TinyBERT uses only 251 KB of SRAM — less than MoveNet's npuRAM5 alone. This is because BERT processes sequential text tokens, not large 2D image tensors. A 192×192 input image requires 110 KB of buffer just for the input. A 512-token text sequence requires far less spatial memory, even though it has more complex attention patterns.
The three deployments answer the central question of this project with quantitative precision. Here are the four conclusions we draw from the data.
MoveNet achieves 94.7% NPU offload and 22 ms latency — well within real-time requirements for a 30 fps application. YOLOv8n achieves 87.9% and 32 ms, also real-time and with the added capability of detecting multiple people simultaneously. For CNN-based vision tasks, the Neural-ART NPU delivers on its 600 GOPS promise.
YOLOv8n's 18 SW epochs (vs MoveNet's 4) come entirely from the detection head — NMS, reshape, transpose. The backbone is still 100% on the NPU. The 7-point gap in offload rate (94.7% → 87.9%) translates to only a 10 ms latency increase. Models with complex post-processing heads can still achieve real-time performance on this hardware.
TinyBERT's 96 SW epochs are not a deployment issue — they reflect a fundamental mismatch between Transformer operations and the Neural-ART instruction set. Softmax, LayerNorm, and attention MatMul are the core of every Transformer block. You cannot remove them without changing the architecture. The 64.4% NPU offload and >100 ms latency make real-time Transformer inference impossible on this hardware in its current form.
TinyBERT uses only 2.3 MB total (vs 4.6 MB for MoveNet) — it fits comfortably in the available memory. The problem is not space but compute: the 96 CPU fallback operations create a pipeline stall at every Transformer block boundary. Future NPU designs that include Softmax and LayerNorm acceleration units would dramatically close this gap.
This project opens several directions for future investigation:
QAT could reduce the accuracy loss from PTQ on TinyBERT, potentially enabling lower-precision representations that better exploit the NPU's INT8 arithmetic.
MoveNet at 224×224 achieves 62.3% OKS vs 57.6% at 192×192. The latency increase (27.6 ms vs 22 ms) is still within real-time bounds. A systematic study of the resolution-accuracy-latency trade-off would guide model selection for different applications.
TinyBERT's low SRAM footprint (251 KB) leaves room to run it alongside a vision model. A cascaded pipeline — YOLOv8n detects people, TinyBERT classifies actions — could be explored as an embedded multimodal system.
MobileViT and EfficientViT replace some Transformer blocks with CNN-equivalent operations while maintaining global context modelling. These could achieve higher NPU offload than pure Transformers while retaining some of the representational power of attention.
The Neural-ART NPU is a highly effective accelerator for CNN-based inference on the STM32N6570-DK. It achieves real-time performance on both single-person (MoveNet, 22 ms) and multi-person (YOLOv8n, 32 ms) pose estimation. However, it is fundamentally limited for Transformer architectures: the absence of Softmax, LayerNorm, and general MatMul acceleration in the NPU instruction set forces 35% of TinyBERT's computation onto the CPU, making real-time Transformer inference impossible. The architecture of the NPU — built around the CONVACC unit and STRENG stream engines — reflects a deliberate design choice for the vision workloads that dominate embedded AI today. As Transformer models move to the edge, this trade-off will become the central engineering challenge for next-generation NPU design.