|
STM32N6 NPU Deployment — Politecnico di Milano
1.0
Documentation for Neural Network Deployment on STM32N6 NPU - Politecnico di Milano 2024-2025
|
Detailed profiling data for each deployment — real numbers extracted directly from the ST Edge AI Core reports. Each model tells a different story about what the Neural-ART NPU can and cannot accelerate.
MoveNet Lightning is a single-person pose estimation model developed by Google,
based on a MobileNetV2 backbone with a Feature Pyramid Network
decoder. It was designed specifically for real-time inference on edge devices —
making it the natural baseline for our CNN benchmark.
The ST variant (st_movenet_lightning_heatmaps_192_int8_pc) was
retrained by STMicroelectronics on a custom COCO subset and pre-quantized
to INT8 per-channel — meaning it required no quantization step from us,
only deployment.
The model outputs heatmaps rather than direct coordinates:
a (48×48×13) tensor where each of the 13 channels is a probability
map for one keypoint. The C firmware postprocessor in
display_spe.c decodes these
by finding the argmax of each channel and converting to screen coordinates.
| cpuRAM2 | 864 KB | 84% |
| npuRAM4 | 378 KB | 84% |
| npuRAM5 | 432 KB | 96% |
| octoFlash | 2.924 MB | 5% |
| Total | 4.559 MB |
YOLOv8n-pose is a multi-person pose estimation model from Ultralytics — it detects all people in the frame simultaneously and outputs bounding boxes with 17 COCO keypoints per person. Unlike MoveNet, it has two distinct parts: a CNN backbone (CSPDarknet + PAN-FPN) that runs almost entirely on the NPU, and a detection head that includes reshape, transpose, softmax, and NMS operations — many of which are not supported by the NPU.
This model was not in the ST Model Zoo — it required manual quantization
via chain_qd. The float PyTorch model was exported to
.tflite and quantized with Post-Training Quantization
using a COCO calibration subset.
| Inference time | 32 ms |
| NPU offload | 87.9% |
| Keypoints | 17 (COCO) |
| Multi-person | Yes |
| NMS on | CPU (SW epoch) |
TinyBERT is a compressed version of BERT — a Transformer-based language model. We deployed it as a proof-of-concept to test the limits of the Neural-ART NPU: what happens when you run a Transformer, not a CNN, on an NPU designed for CNNs?
The answer is visible in the epoch breakdown: 96 out of 270 epochs fall back to the CPU. The reason is architectural — Transformer blocks contain operations that are fundamentally incompatible with the CONVACC-centric design of the Neural-ART NPU.
The following is extracted directly from the
network_generate_report.txt of the TinyBERT deployment
(run 2025_07_16_22_43_20). Each Transformer block generates the same
pattern of SW epochs — repeated 4 times (one per BERT layer):
Epochs are listed in compiler-emission order, which does not match the
mathematical LayerNorm formula y = γ·(x−μ)/σ + β:
kernel fusion and quantization boundaries reorder the primitive ops
(Sub, Mul, Reciprocal) so the sequence emitted to
the NPU scheduler no longer mirrors the formula.
| cpuRAM2 | 0 KB | 0% |
| npuRAM5 | 251 KB | 56% |
| octoFlash | 2.062 MB | 3% |
| Total | 2.308 MB |
| Metric | MoveNet | YOLOv8n | TinyBERT |
|---|---|---|---|
| Architecture | CNN (MobileNetV2) | CNN + Det. Head | Transformer (BERT) |
| Total epochs | 75 | 149 | 270 |
| EC epochs (NPU) | 71 | 131 | 174 |
| SW epochs (CPU) | 4 | 18 | 96 |
| NPU offload % | 94.7% | 87.9% | 64.4% |
| Weights (OctoFlash) | 2.924 MB | ~3.2 MB | 2.062 MB |
| Activations (SRAM) | 1.635 MB | ~1.8 MB | 251 KB |
| Inference time | 22 ms | 32 ms | >100 ms |
| SW epoch cause | Resize, Dequantize | NMS, Reshape | Softmax, LayerNorm, Attention |