|
STM32N6 NPU Deployment — Politecnico di Milano
1.0
Documentation for Neural Network Deployment on STM32N6 NPU - Politecnico di Milano 2024-2025
|
23 source files — 13 Python and 10 C — documented with
@brief, @param, @return,
inline comments, and auto-generated call graphs.
This page shows the execution flow of each layer so you can navigate
directly to the file you need.
Runs on the host PC. Triggered by
python stm32ai_main.py operation_mode=deployment.
The execution flows top to bottom — each file calls the next.
Decorated with @hydra.main. Reads
user_config.yaml, dispatches to the correct
operation mode (deployment, chain_qd,
training, ...). The only file the user ever calls directly.
Validates every section of user_config.yaml using
Hydra + OmegaConf. Checks legal/required fields, sets defaults for
missing optional ones, raises clear errors pointing to the exact
invalid field. Returns a DefaultMunch config object
used as a global reference by all downstream functions.
Orchestrates the deployment steps: generates app_config.h,
selects the board .conf file, calls
stm32ai_deploy_stm32n6() in common_deploy.py
which invokes ST Edge AI Core and STM32CubeIDE in sequence.
Reads tensor shapes from the quantized TFLite model using the
TFLite interpreter. Writes app_config.h with
NN_HEIGHT, NN_WIDTH, KEYPOINTS_NB, POSTPROCESS_TYPE,
and confidence thresholds.
Calls ST Edge AI Core CLI, copies generated files into the CubeIDE
project via the .conf templates list, then invokes
STM32CubeIDE headless build, SigningTool, and CubeProgrammer flash.
Patches the C source files and linker script to place the weight binary
(network_atonbuf.xSPI2.raw) at the correct OctoFlash address
(0x70380000). Without this patch, the firmware would look for
weights at the wrong memory address and crash at boot.
Post-Training Quantization: converts float32 model to INT8.
Routes to TFLite Converter (for .h5 models like YOLOv8)
or ONNX quantizer (for .onnx models like TinyBERT).
The inner _representative_data_gen() feeds calibration
samples to determine scale and zero-point per layer.
Runs on the STM32N6570-DK board after flashing. Execution starts at reset and loops continuously — capturing a frame, running inference, rendering the result, repeat.
HAL and clock initialisation, DCMIPP start, LCD setup.
Then enters the main loop: waits for a camera frame snapshot,
calls LL_ATON_RT_Main() to trigger the NPU epoch
controller, then calls the postprocessor to decode results and
render the skeleton on the LCD.
The only generated file included directly by the handwritten C firmware.
Contains all model-specific constants: NN_HEIGHT=192,
NN_WIDTH=192, AI_POSE_PP_POSE_KEYPOINTS_NB=13,
POSTPROCESS_TYPE=POSTPROCESS_SPE_MOVENET_UF,
AI_POSE_PP_CONF_THRESHOLD=0.4.
Every other C file reads from this header — change the model,
regenerate this file, recompile.
Configures the MIPI CSI-2 interface and the DCMIPP hardware block. Sets up two simultaneous output pipes: the display pipe continuously writes full-resolution frames to the PSRAM LCD framebuffer, while the NN pipe delivers a cropped, resized snapshot to npuRAM4 when triggered. Both pipes run in hardware — zero CPU involvement for memory transfers.
Not a file you wrote — part of the ll_aton runtime library injected
by common_deploy.py. Executes all 75 epochs in sequence:
loads the EC blob for each of the 71 NPU epochs into hardware registers,
calls ll_sw_* functions for the 4 SW epochs.
Input: npuRAM4 (192×192×3 camera frame).
Output: npuRAM5 (48×48×13 heatmaps, float32).
Reads float32 heatmaps from npuRAM5.
Finds argmax per channel (48×48) to get keypoint coordinates.
Filters by confidence threshold from app_config.h.
Draws skeleton lines on LCD foreground layer using
connectivity table from display_keypoints_13.h.
Reads YOLOv8 detection output (bounding boxes + 17 keypoints per person). Applies NMS filtering by confidence and IoU thresholds. Draws bounding boxes and 17-keypoint COCO skeletons on LCD for all detected persons simultaneously.