STM32N6 NPU Deployment — Politecnico di Milano  1.0
Documentation for Neural Network Deployment on STM32N6 NPU - Politecnico di Milano 2024-2025
Chapter 6 — Results & Analysis

Chapter 6 — Results & Analysis

What the Numbers Tell Us

A cross-model comparative analysis connecting the architecture decisions of the Neural-ART NPU to the performance numbers we measured. This chapter answers the central question of the project: how well does a CNN-centric NPU handle different neural network architectures?

6.1 — NPU Offload Rate: Architecture Determines Everything

The most important finding of this project is not a single number — it is a pattern. The NPU offload rate is not a function of model size, weight count, or number of parameters. It is a function of which operations the model uses. The three models we deployed demonstrate this with unusual clarity.

NPU Offload Rate vs Architecture Type 100% 80% 60% 40% 94.7% MoveNet CNN pure 87.9% YOLOv8n CNN + head 64.4% TinyBERT Transformer 30.3% gap

The 30-point gap between MoveNet (94.7%) and TinyBERT (64.4%) is not due to model size — TinyBERT actually has fewer weights (2.06 MB vs 2.92 MB). It is entirely due to the operations required by each architecture:

CNN operations — NPU-friendly
  • Conv2D / DepthwiseConv2D — the CONVACC unit was built exactly for this: slide a filter over a spatial tensor and accumulate dot products
  • ReLU / ReLU6 — simple threshold, trivially fused with the preceding Conv in hardware
  • BatchNorm — scale and shift, executed as a multiply-add in the ACTIV unit
  • Residual Add — element-wise add between two buffers, supported natively
  • MaxPool / AvgPool — spatial reduction, supported by the POOL unit
Transformer operations — NPU-hostile
  • Softmax — requires computing exp(x) over a full sequence, then normalising. No exp unit in CONVACC.
  • LayerNorm — mean + variance over a sequence, then scale/shift. Requires Dequantize → Reciprocal → Mul → Sub chain.
  • MatMul (Q·Kᵀ) — general matrix multiplication with transposed operand. CONVACC is optimised for Conv kernels, not square matrix products.
  • Split / Transpose — data layout changes required by multi-head attention. The stream engine cannot prefetch non-sequential patterns.

6.2 — Latency: Why SW Epochs Dominate the Runtime

The latency numbers reveal something important: a small number of SW epochs can dominate the total inference time far beyond their numerical proportion.

Inference latency (ms) — lower is better MoveNet (22 ms) 22 ms ✓ real-time YOLOv8n (32 ms) 32 ms ✓ real-time TinyBERT (>100 ms) >100 ms ✗ not real-time 33ms = 30fps

Both CNN models achieve real-time performance (below 33 ms for 30 fps). TinyBERT exceeds 100 ms despite having fewer total weights than MoveNet. The reason is the cost structure of the two execution modes:

EC epoch cost (NPU)

An EC epoch runs entirely in the Neural-ART hardware. The STRENG stream engines prefetch weights from OctoFlash while the CONVACC units compute — weight loading and computation overlap. For a typical Conv layer: ~0.15–0.3 ms.

71 EC epochs × ~0.3 ms avg ≈ 21 ms
SW epoch cost (CPU)

A SW epoch runs on the Cortex-M55 with Helium SIMD. It requires CPU cycles, memory bandwidth, and cache invalidation when crossing NPU/CPU boundaries. Softmax over a 512-token sequence: ~5–15 ms per occurrence.

96 SW epochs × ~1–2 ms avg ≈ 80–100 ms
Key insight — the boundary crossing cost
Every time execution switches from an EC epoch to a SW epoch (or vice versa), the firmware must synchronise the NPU and CPU memory views. The NPU writes activations to npuRAM5 with no cache coherency guarantee — the CPU must invalidate its cache before reading those values. TinyBERT crosses this boundary 96 times, each crossing adding latency beyond the raw computation cost. MoveNet crosses it only 4 times — all at the very end of the network, after all the heavy computation is done.

6.3 — Memory: An Unexpected Finding

Memory usage reveals a counterintuitive result: TinyBERT uses significantly less SRAM than MoveNet, despite having more total epochs.

Memory region MoveNet YOLOv8n TinyBERT Notes
cpuRAM2 864 KB ~900 KB 0 KB SW epoch buffers for resize ops
npuRAM4 378 KB ~400 KB 0 KB nn_in buffer (camera frame)
npuRAM5 432 KB ~448 KB 251 KB nn_out + intermediate activations
octoFlash 2.924 MB ~3.2 MB 2.062 MB Model weights (read-only)
Total 4.559 MB ~5.0 MB 2.308 MB

TinyBERT uses only 251 KB of SRAM — less than MoveNet's npuRAM5 alone. This is because BERT processes sequential text tokens, not large 2D image tensors. A 192×192 input image requires 110 KB of buffer just for the input. A 512-token text sequence requires far less spatial memory, even though it has more complex attention patterns.

Practical implication: TinyBERT's low SRAM usage means it could theoretically be deployed alongside a vision model on the same board — the memory footprints do not conflict. The bottleneck is latency, not memory. This suggests that hybrid architectures (vision + language on the same MCU) could be viable if the Transformer operations were better supported by the NPU hardware.

6.4 — Conclusions

The three deployments answer the central question of this project with quantitative precision. Here are the four conclusions we draw from the data.

1. The Neural-ART NPU is excellent for CNN inference

MoveNet achieves 94.7% NPU offload and 22 ms latency — well within real-time requirements for a 30 fps application. YOLOv8n achieves 87.9% and 32 ms, also real-time and with the added capability of detecting multiple people simultaneously. For CNN-based vision tasks, the Neural-ART NPU delivers on its 600 GOPS promise.

2. Detection heads add overhead but remain manageable

YOLOv8n's 18 SW epochs (vs MoveNet's 4) come entirely from the detection head — NMS, reshape, transpose. The backbone is still 100% on the NPU. The 7-point gap in offload rate (94.7% → 87.9%) translates to only a 10 ms latency increase. Models with complex post-processing heads can still achieve real-time performance on this hardware.

3. Transformers hit a hard architectural wall

TinyBERT's 96 SW epochs are not a deployment issue — they reflect a fundamental mismatch between Transformer operations and the Neural-ART instruction set. Softmax, LayerNorm, and attention MatMul are the core of every Transformer block. You cannot remove them without changing the architecture. The 64.4% NPU offload and >100 ms latency make real-time Transformer inference impossible on this hardware in its current form.

4. Memory is not the bottleneck — operations are

TinyBERT uses only 2.3 MB total (vs 4.6 MB for MoveNet) — it fits comfortably in the available memory. The problem is not space but compute: the 96 CPU fallback operations create a pipeline stall at every Transformer block boundary. Future NPU designs that include Softmax and LayerNorm acceleration units would dramatically close this gap.

6.5 — Future Work

This project opens several directions for future investigation:

⚡ Quantization-Aware Training for Transformers

QAT could reduce the accuracy loss from PTQ on TinyBERT, potentially enabling lower-precision representations that better exploit the NPU's INT8 arithmetic.

📈 Higher resolution models

MoveNet at 224×224 achieves 62.3% OKS vs 57.6% at 192×192. The latency increase (27.6 ms vs 22 ms) is still within real-time bounds. A systematic study of the resolution-accuracy-latency trade-off would guide model selection for different applications.

👥 Multi-model pipeline

TinyBERT's low SRAM footprint (251 KB) leaves room to run it alongside a vision model. A cascaded pipeline — YOLOv8n detects people, TinyBERT classifies actions — could be explored as an embedded multimodal system.

🛠 Hybrid CNN-Transformer architectures

MobileViT and EfficientViT replace some Transformer blocks with CNN-equivalent operations while maintaining global context modelling. These could achieve higher NPU offload than pure Transformers while retaining some of the representational power of attention.

Final answer to the research question

The Neural-ART NPU is a highly effective accelerator for CNN-based inference on the STM32N6570-DK. It achieves real-time performance on both single-person (MoveNet, 22 ms) and multi-person (YOLOv8n, 32 ms) pose estimation. However, it is fundamentally limited for Transformer architectures: the absence of Softmax, LayerNorm, and general MatMul acceleration in the NPU instruction set forces 35% of TinyBERT's computation onto the CPU, making real-time Transformer inference impossible. The architecture of the NPU — built around the CONVACC unit and STRENG stream engines — reflects a deliberate design choice for the vision workloads that dominate embedded AI today. As Transformer models move to the edge, this trade-off will become the central engineering challenge for next-generation NPU design.

← Chapter 5 — Case Studies Part 2 — Code Reference →