|
STM32N6 NPU Deployment — Politecnico di Milano
1.0
Documentation for Neural Network Deployment on STM32N6 NPU - Politecnico di Milano 2024-2025
|
ST Edge AI Core is the converter tool that takes a quantized neural network
model (.tflite, .onnx) and produces optimised C source
files, assigns each layer to either the NPU or the CPU, and generates a
detailed profiling report. This is where Python ends and C begins.
A neural network model file — even a quantized INT8 .tflite —
cannot run directly on a microcontroller. It is a portable format that needs
to be compiled into something a microcontroller can actually execute.
ST Edge AI Core is the tool that performs this translation.
It does four things in one pass:
Fuses adjacent operations (Conv + BatchNorm + ReLU into a single kernel), removes redundant transposes, and reformats tensor layouts for the target hardware.
Analyses every layer and assigns it to either an EC epoch (executed on the Neural-ART NPU hardware) or a SW epoch (executed on the Cortex-M55 CPU) based on NPU hardware support.
Places weights in OctoFlash and assigns activation buffers to the correct
SRAM banks (npuRAM4, npuRAM5) based on the .mpool memory
configuration file for the STM32N6570-DK.
Produces network.c (layer execution schedule),
network_data_params.c (weight arrays in flash),
and all supporting headers — ready to compile with STM32CubeIDE.
You never call ST Edge AI Core manually — common_deploy.py calls it
automatically during deployment. But understanding the command it runs helps
make sense of the output. Here is the exact command used for MoveNet:
| Argument | Meaning |
|---|---|
| --target stm32n6 | Target the STM32N6 NPU backend (Neural-ART) |
| --st-neural-art | Loads the NPU configuration JSON for the STM32N6570-DK board |
| --load-mpool | Memory pool config: tells the compiler which SRAM banks to use and at which addresses |
| --input-data-type uint8 | Camera frames arrive as uint8 — no CPU dequantization needed at input |
| --enable-epoch-controller | Generates the epoch schedule that the NPU runtime uses to interleave EC and SW epochs |
| -O3 --Oalt-sched --Ocache-opt | Full optimisation: alternative scheduling, cache-aware layout, minimum code size |
The following is extracted directly from the
network_generate_report.txt produced by ST Edge AI Core v2.1.0
when we deployed MoveNet Lightning (192×192, INT8, 13 keypoints).
75 total epochs: 71 EC (NPU) and 4 SW (CPU).
| Epoch | Operation | Why on CPU? |
|---|---|---|
| epoch 59 | Resize (bilinear) | Bilinear interpolation requires floating-point position arithmetic. The Neural-ART NPU has no bilinear resize unit. |
| epoch 63 | Resize (bilinear) | Same — second decoder upsample stage. |
| epoch 67 | Resize (bilinear) | Same — third decoder upsample stage. |
| epoch 74 | DequantizeLinear | Converts INT8 output heatmaps to float32 for the C postprocessor. A one-time scale+offset operation at the very end of the network. |
Key insight: these 4 SW epochs are not in the backbone — they are all in the decoder (upsampling) and output stage. The entire MobileNetV2 backbone (epochs 1–58) runs 100% on the NPU. This is why MoveNet achieves 94.7% offload despite having CPU fallback operations.
ST Edge AI Core places every buffer at a specific physical address. The following is the exact memory layout generated for MoveNet:
| Region | Address range | Used | Available | Content |
|---|---|---|---|---|
| cpuRAM2 | 0x34100000–0x34200000 | 864 KB | 1 MB | CPU activations (SW epoch intermediate buffers) |
| npuRAM4 | 0x34270000–0x342E0000 | 378 KB | 448 KB | nn_in: camera frame 192×192×3 input to NPU |
| npuRAM5 | 0x342E0000–0x34350000 | 432 KB | 448 KB | nn_out: heatmaps 48×48×13 output from NPU |
| octoFlash | 0x70380000–0x7066C810 | 2.924 MB | 61 MB | Weights: 2,355,908 bytes of INT8 parameters |
| npuRAM3, npuRAM6 | — | 0 B | 448 KB each | Not used by MoveNet — available for other models |
| hyperRAM | 0x90000000 | 0 B | 16 MB | Not used by NPU — used by LCD framebuffers |
The full report contains 87 layer entries. Here are the first 10 to show the format and what it reveals. The model starts with a QUANTIZE conversion layer, then proceeds through the MobileNetV2 backbone with alternating CONV_2D and DEPTHWISE_CONV_2D blocks.
m_id is the layer index;
oshape is the output tensor shape [batch, height, width, channels];
param/size is weights count / bytes (INT8);
macc is the number of multiply-accumulate operations —
the dominant cost of convolution.
The total of 242.9 million MACs completes in 22 ms on the NPU
(≈11 GOPS effective throughput, well below the 600 GOPS peak due to memory bandwidth limits).
After a successful run, the output directory contains two categories of files: the C source files that will be compiled into the firmware, and the analysis reports that document what was generated.
network.c is 5,882 lines of auto-generated C. You will never edit it —
but understanding its three-part structure explains exactly how the NPU executes
your model at runtime. This is the level of detail that distinguishes a deployment
you truly understand from one you just ran a script for.
The first ~30 lines are comments documenting every memory pool the compiler
considered and its decision. Each pool has a score — the compiler's
fitness metric — and the pool with the highest score for each tensor wins:
npuRAM4/5 score 94 — fast SRAM directly connected to the NPU, zero cache overhead. OctoFlash scores 50 — high latency, but the only option for 2.9 MB of weights. The compiler places activations in npuRAM and weights in flash accordingly.
This is the most unusual part. The NPU does not execute layers through C function calls. Instead, each EC epoch is encoded as a binary blob — a sequence of 64-bit words that the NPU epoch controller interprets as microcode, loading them directly into hardware registers:
Each blob encodes the complete configuration for one NPU epoch: CONVACC unit selection, weight buffer address in OctoFlash, activation buffer in npuRAM, kernel size, stride, padding. The epoch controller loads these directly into NPU registers with zero CPU involvement — this is why EC epochs are orders of magnitude faster than SW epochs.
For the 4 SW epochs, the compiler generates standard C calls into the
ll_sw library, executed by the Cortex-M55 with Helium SIMD
(enabled by the --mvei flag in the compiler command):
After epoch 74, the output is a float32 heatmap tensor in npuRAM5 —
ready for the C firmware postprocessor in
display_spe.c to decode
the argmax coordinates and draw the skeleton on the LCD.
network.c is auto-generated and not in the stm32_docs source tree.
The handwritten firmware calls LL_ATON_RT_Main() in
main.c, which triggers the epoch controller to
execute all 75 blobs in sequence.
The only generated file your firmware includes directly is
app_config.h — documented below and in Part 2.
Separately from the network C files, gen_h_file.py generates
app_config.h — the configuration header that tells the C firmware
everything it needs to know about the model. Here is the actual file generated
for our MoveNet deployment:
| Define | Value | Used by |
|---|---|---|
| POSTPROCESS_TYPE | POSTPROCESS_SPE_MOVENET_UF | Selects the argmax heatmap decoder in display_spe.c |
| NN_HEIGHT / NN_WIDTH | 192, 192 | DCMIPP crop window and NPU input buffer sizing |
| AI_POSE_PP_CONF_THRESHOLD | 0.4 | Keypoints below 40% confidence are not drawn |
| AI_POSE_PP_POSE_KEYPOINTS_NB | 13 | Selects the 13-keypoint skeleton (no head landmarks) |
| HEATMAP_WIDTH/HEIGHT | 48, 48 (=192/4) | Output heatmap resolution — argmax decoded per channel |