STM32N6 NPU Deployment — Politecnico di Milano  1.0
Documentation for Neural Network Deployment on STM32N6 NPU - Politecnico di Milano 2024-2025
Chapter 5 — Case Studies

Chapter 5 — Case Studies

Three Models, One Board

Detailed profiling data for each deployment — real numbers extracted directly from the ST Edge AI Core reports. Each model tells a different story about what the Neural-ART NPU can and cannot accelerate.

MoveNet • CNN • 94.7% NPU
YOLOv8n • CNN+Head • 87.9% NPU
TinyBERT • Transformer • 64.4% NPU
Case Study 1
MoveNet Lightning — The CNN Benchmark

Architecture & why we chose it

MoveNet Lightning is a single-person pose estimation model developed by Google, based on a MobileNetV2 backbone with a Feature Pyramid Network decoder. It was designed specifically for real-time inference on edge devices — making it the natural baseline for our CNN benchmark. The ST variant (st_movenet_lightning_heatmaps_192_int8_pc) was retrained by STMicroelectronics on a custom COCO subset and pre-quantized to INT8 per-channel — meaning it required no quantization step from us, only deployment.

The model outputs heatmaps rather than direct coordinates: a (48×48×13) tensor where each of the 13 channels is a probability map for one keypoint. The C firmware postprocessor in display_spe.c decodes these by finding the argmax of each channel and converting to screen coordinates.

Profiling data — real numbers from ST Edge AI Core v2.1.0

Epoch breakdown
71 EC (94.7%)
4 SW (5.3%)
Total: 75 epochs
SW epochs 59, 63, 67: Resize bilinear
SW epoch 74: DequantizeLinear
All 4 SW in decoder — backbone 100% NPU
Memory allocation
cpuRAM2 864 KB 84%
npuRAM4 378 KB 84%
npuRAM5 432 KB 96%
octoFlash 2.924 MB 5%
Total 4.559 MB
22 ms
Inference time
94.7%
NPU offload
57.6%
OKS (192×192)
242M
MACs total
Why 94.7% and not 100%: The 4 SW epochs are all in the decoder — not the backbone. The entire MobileNetV2 feature extractor (epochs 1–58) runs 100% on the NPU. Bilinear resize and dequantization fall back to the Cortex-M55 + Helium SIMD. This is the best possible result for a CNN on this hardware.
Case Study 2
YOLOv8n-pose — CNN with Detection Head

Architecture & why it is harder than MoveNet

YOLOv8n-pose is a multi-person pose estimation model from Ultralytics — it detects all people in the frame simultaneously and outputs bounding boxes with 17 COCO keypoints per person. Unlike MoveNet, it has two distinct parts: a CNN backbone (CSPDarknet + PAN-FPN) that runs almost entirely on the NPU, and a detection head that includes reshape, transpose, softmax, and NMS operations — many of which are not supported by the NPU.

This model was not in the ST Model Zoo — it required manual quantization via chain_qd. The float PyTorch model was exported to .tflite and quantized with Post-Training Quantization using a COCO calibration subset.

Epoch breakdown
131 EC (87.9%)
18 SW (12.1%)
Total: 149 epochs
SW epochs (head): Reshape, Transpose, Softmax, Slice, Dequantize
Backbone 100% NPU — head partially on CPU
Results
Inference time 32 ms
NPU offload 87.9%
Keypoints 17 (COCO)
Multi-person Yes
NMS on CPU (SW epoch)
Why 87.9% and not 94.7%: The 7% gap comes from the detection head. YOLOv8's head applies Non-Maximum Suppression (NMS) — a data-dependent operation where the number of detections varies per frame. The NPU epoch controller requires a fixed, predictable execution schedule, so NMS and the associated reshape/transpose operations must fall back to the CPU. The backbone (all convolutions) is still 100% on the NPU.
Case Study 3
TinyBERT — The Transformer Challenge

Architecture & why it is fundamentally different

TinyBERT is a compressed version of BERT — a Transformer-based language model. We deployed it as a proof-of-concept to test the limits of the Neural-ART NPU: what happens when you run a Transformer, not a CNN, on an NPU designed for CNNs?

The answer is visible in the epoch breakdown: 96 out of 270 epochs fall back to the CPU. The reason is architectural — Transformer blocks contain operations that are fundamentally incompatible with the CONVACC-centric design of the Neural-ART NPU.

SW epoch analysis — what falls back and why (real data)

The following is extracted directly from the network_generate_report.txt of the TinyBERT deployment (run 2025_07_16_22_43_20). Each Transformer block generates the same pattern of SW epochs — repeated 4 times (one per BERT layer):

# Pattern repeated per Transformer block (4 blocks total):
epoch N    EC   # Q/K/V linear projection (Conv) — NPU
epoch N+1  -SW- Transpose         # Q·Kᵀ requires reshape before MatMul
epoch N+3  EC   # MatMul partial — NPU
epoch N+4  -SW- Conv ×4           # attention heads: non-standard conv pattern
epoch N+9  -SW- Softmax           # attention weights — no NPU Softmax unit
epoch N+10 -SW- Split             # multi-head split — data-dependent
epoch N+11 EC   # value projection — NPU
epoch N+12 -SW- Conv ×4           # output projection: non-standard
...
epoch N+K  -SW- DequantizeLinear   # LayerNorm requires float32
epoch N+K+1 -SW- Reciprocal         # 1/std for LayerNorm normalisation
epoch N+K+2 -SW- QuantizeLinear      # re-quantize after LayerNorm
epoch N+K+3 -SW- Mul                # scale step of LayerNorm
epoch N+K+4 -SW- QuantizeLinear      # second re-quantize
epoch N+K+5 -SW- Sub                # mean subtraction of LayerNorm

Epochs are listed in compiler-emission order, which does not match the mathematical LayerNorm formula y = γ·(x−μ)/σ + β: kernel fusion and quantization boundaries reorder the primitive ops (Sub, Mul, Reciprocal) so the sequence emitted to the NPU scheduler no longer mirrors the formula.

Epoch breakdown
174 EC (64.4%)
96 SW (35.6%)
Total: 270 epochs
SW types: Softmax, Transpose, Split, Conv (attention), DequantizeLinear, Reciprocal, QuantizeLinear, Mul, Sub
These are the core Transformer ops — not post-processing
Memory allocation
cpuRAM2 0 KB 0%
npuRAM5 251 KB 56%
octoFlash 2.062 MB 3%
Total 2.308 MB
Smaller than MoveNet despite more epochs — BERT is parameter-efficient per layer. cpuRAM2 = 0 because SW epochs use npuRAM5 for all intermediate buffers.
>100 ms
Inference time
real-time impossible
64.4%
NPU offload
35.6% on CPU
270
Total epochs
3.6× more than MoveNet
Why 64.4% and not 87.9%: This is a structural incompatibility, not a tuning gap. LayerNormalization requires computing the mean and variance of a sequence, then applying a learned scale and shift. This involves Dequantize → Reciprocal → Mul → Sub — a chain of 6 SW epochs per layer, repeated 4 times. Softmax requires an exponential followed by a normalisation — also not in the NPU instruction set. These are not peripheral operations like bilinear resize in MoveNet — they are the core of every Transformer block. No amount of compiler optimisation can map them to CONVACC.

Cross-model comparison

NPU offload rate MoveNet Lightning 71 EC epochs (94.7%) 4 SW YOLOv8n-pose 131 EC epochs (87.9%) 18 SW TinyBERT 174 EC epochs (64.4%) 96 SW epochs (35.6%) MoveNet (NPU) YOLOv8 (NPU) TinyBERT (NPU) SW epoch (CPU)
Metric MoveNet YOLOv8n TinyBERT
Architecture CNN (MobileNetV2) CNN + Det. Head Transformer (BERT)
Total epochs 75 149 270
EC epochs (NPU) 71 131 174
SW epochs (CPU) 4 18 96
NPU offload % 94.7% 87.9% 64.4%
Weights (OctoFlash) 2.924 MB ~3.2 MB 2.062 MB
Activations (SRAM) 1.635 MB ~1.8 MB 251 KB
Inference time 22 ms 32 ms >100 ms
SW epoch cause Resize, Dequantize NMS, Reshape Softmax, LayerNorm, Attention
← Chapter 4 — Deployment Workflow Next: Chapter 6 — Results →