|
STM32N6 NPU Deployment — Politecnico di Milano
1.0
Documentation for Neural Network Deployment on STM32N6 NPU - Politecnico di Milano 2024-2025
|
A step-by-step walkthrough of the complete deployment pipeline — what files go in, what comes out, what each tool does, and what can go wrong. This chapter connects Chapter 3 (the tools) to Part 2 (the code).
Before diving into each step, here is the big picture: what files enter each step and what files come out. Every arrow in this diagram corresponds to a file on disk that you can inspect.
/opt/ST/STEdgeAI/2.1//home/.../stm32cubeide_1.18.1_2/stm32ai-modelzoo/ and
stm32ai-modelzoo-services/source st_zoo/bin/activate
The first decision is which model to deploy.
If it comes from the ST Model Zoo,
it is already quantized — set model_path directly to the
.tflite file and operation_mode: deployment.
If it is an external model (YOLOv8, TinyBERT), set
operation_mode: chain_qd to quantize and deploy in one pass.
Also set: tools.stedgeai.path_to_stedgeai,
tools.path_to_cubeIDE, and
deployment.hardware_setup.board: STM32N6570-DK.
See the annotated user_config.yaml in
Section 3.2 for the full example.
stm32ai_main.py
immediately calls
parse_config.py
to validate every field in user_config.yaml.
If anything is wrong — a missing path, an unsupported board, an invalid
quantization type — the pipeline stops here with a clear error message
pointing to the exact field. This is the fail-fast design: no computation
starts until the configuration is verified.
model_path does not exist →
check the path relative to the pose_estimation/ foldertools.stedgeai.version does not match installed binary →
run stedgeai --version to checkpath_to_cubeIDE points to a directory, not the executable →
must point to the stm32cubeide binary directlykeypoints: 13 with a 17-keypoint model →
mismatch causes wrong skeleton rendering
ST Edge AI Core is called
automatically by common_deploy.py. It takes 30–90 seconds
depending on model size. The output folder
experiments_outputs/YYYY_MM_DD_HH_MM_SS/
contains everything you need to inspect the result:
| Output file | What to check |
|---|---|
| network_generate_report.txt | Total epochs, EC vs SW count, memory usage per SRAM bank |
| C_header/app_config.h | Verify NN_HEIGHT, NN_WIDTH, KEYPOINTS_NB match your model |
| stm32ai_main.log | Check for warnings about unsupported operations or memory overflow |
| generated/network.c | 5,882 lines — only check if the build fails in step 4 |
chain_qd
to quantize firstsudo chmod -R 755 experiments_outputs/
common_deploy.py first copies the generated files into the
CubeIDE project (via the templates list in the
.conf file), then runs three tools in sequence —
STM32CubeIDE for the headless build, STM32SigningTool for the signed
binary, and STM32CubeProgrammer twice (once for the firmware at
0x70100000, once for the weights at 0x70380000).
The compilation takes 2–5 minutes on first build; subsequent builds
are faster due to incremental compilation.
/opt/ST/STEdgeAI/2.1/Utilities/linux/ to PATHAfter flashing completes: toggle boot switches to LEFT, power-cycle the board (unplug and replug USB), and wait 3–5 seconds. The LCD should show the camera preview with the skeleton overlay.
The C firmware execution starts in
main.c:
HAL init → DCMIPP start → inference loop →
LL_ATON_RT_Main() (NPU) →
display_spe.c
(decode + draw). See Part 2 for the full annotated call chain.
We deployed three models of increasing architectural complexity. The table shows the key differences in the deployment configuration and the resulting performance.
| Model | Format | op_mode | model_type | Epochs | NPU % | Latency |
|---|---|---|---|---|---|---|
| MoveNet Lightning | .tflite INT8 | deployment | heatmaps_spe | 75 (71 EC + 4 SW) | 94.7% | 22 ms |
| YOLOv8n-pose | .tflite INT8 | chain_qd | yolo_mpe | 149 (131 EC + 18 SW) | 87.9% | 32 ms |
| TinyBERT | .onnx INT8 | chain_qd | — | 270 (174 EC + 96 SW) | 64.4% | >100 ms |