STM32N6 NPU Deployment — Politecnico di Milano
1.0
Documentation for Neural Network Deployment on STM32N6 NPU - Politecnico di Milano 2024-2025
Part 3 — Module Groups
Part 3 — Module Groups
Code Organised by Layer
Part 3 categorizes the project files by their architectural domain.
Rather than tracing the execution flow, this view separates the host-side Python orchestration from the target-side C firmware, allowing you to examine the responsibilities of each subsystem independently.
Firmware — 10 C/H files
PythonPipeline — 13 Python files
🕑 Part 2 — Execution Flow
Maps the sequence of operations across the deployment pipeline. Use this to trace the call graph and understand the step-by-step lifecycle of the process.
🏛 Part 3 — Architectural Layers
Groups components by their deployment target (Host PC vs. STM32 target). Use this to study the isolated subsystems and understand the distinct scope of each environment.
Two layers — one project
Group
STM32N6570-DK Firmware Application
Complete firmware for real-time pose estimation on STM32N6
This module contains the C source and header files compiled into the target firmware flashed to the STM32N6570-DK.
Executing entirely on the Cortex-M55 CPU (800 MHz), this stack orchestrates the end-to-end real-time application:
managing MIPI CSI-2 camera acquisition, triggering the Neural-ART NPU via the ll_aton runtime, decoding the neural network outputs, and driving the LCD display.
Generated automatically by the Python pipeline (gen_h_file.py), this is the critical configuration bridge between the host and target.
It defines network dimensions, keypoint counts, post-processing modes, and confidence thresholds. If the model changes, this file is regenerated to automatically update the C logic.
The top of the C call graph. It handles the Hardware Abstraction Layer (HAL), clock and peripheral initialization before starting the camera interface.
It hosts the primary infinite loop: triggering a camera snapshot, executing the NPU via LL_ATON_RT_Main(), and dispatching the appropriate post-processing and rendering routines.
Configures the MIPI CSI-2 interface and the DCMIPP peripheral.
It implements a zero-CPU-overhead DMA architecture utilizing two independent pipes:
a display pipe continuously streams background frames directly to PSRAM for the LCD, while the NN pipe routes cropped snapshots to npuRAM4 strictly on demand.
The dedicated MoveNet post-processor.
It reads float32 heatmaps (48×48×13) generated by the NPU in npuRAM5, computes the argmax per channel, applies the confidence threshold, and renders the 13-keypoint skeleton to the LCD foreground.
The dedicated YOLOv8n-pose post-processor.
It parses the dense detection outputs (bounding boxes plus 17 keypoints per person), applies Non-Maximum Suppression (NMS) filtering and simultaneously renders bounding boxes and full COCO-standard skeletons for all detected individuals.
Image cropping and resizing utilities utilized by the neural network pipeline to extract the precise Region of Interest (ROI) from the full camera frame before loading it into the NPU input buffers.
Static skeleton connectivity mapping tables.
These arrays define which keypoint indices are connected by lines to draw the skeletal structure (13 points for MoveNet missing head landmarks, 17 points for the standard COCO format used by YOLOv8).
Global dependencies. main.h centralizes global typedefs, peripheral handle declarations, and external function prototypes.
utils.h provides lightweight, inline helper macros for hardware timing, debug logging, and memory buffer alignment.
Group
Python Deployment Pipeline
Python scripts orchestrating the full model-to-firmware pipeline
This module contains the Python infrastructure executed on the host PC.
Organized by functional role, these scripts handle the entire lifecycle: from dataset parsing and neural network quantization to automated code generation and headless flashing of the STM32 target.
The primary entry point. Utilizing the Hydra configuration framework (@main), it parses the user's YAML configuration and dispatches execution to the appropriate pipeline mode (quantization, evaluation, or deployment).
The strict validation engine. It enforces a fail-fast design by thoroughly validating every field and dependency within user_config.yaml before any processing begins.
The high-level deployment orchestrator. It manages the build sequence: triggering C header generation, invoking the ST Edge AI tools, and coordinating the headless build process via STM32CubeIDE.
The external tool interface. It handles direct command-line invocations for the ST Edge AI Core CLI, the STM32CubeIDE headless builder, the ST Signing Tool, and STM32CubeProgrammer for automated board flashing.
The critical configuration generator. It dynamically reads tensor shapes and model parameters from the input model (e.g., TFLite) and writes app_config.h. This is the primary mechanism by which the Python pipeline dictates the C firmware's runtime behavior.
Memory allocation utility. It programmatically patches the C project's linker script to ensure the generated model weight binary (model_weights.bin) is correctly placed at 0x70380000 in the target's Octo-SPI Flash memory.
Model loading and hardware-in-the-loop interface. It provides utilities for loading quantized models and interfacing with the ST Edge AI runner for direct, on-board evaluation using the stedgeai_n6 target mode.
The Post-Training Quantization (PTQ) engine. It converts floating-point models to INT8 precision, supporting both TFLite pipelines (for .h5 models) and ONNX pipelines (for .onnx models).
Dataset preparation pipeline. It handles image loading, rescaling operations, and the specific normalization steps required for single-image network inputs.
The ground-truth ingestion module. Responsible for parsing complex COCO-format JSON annotations, batch-loading images, and normalizing keypoint coordinates for evaluation.
The host-side evaluation decoders. Mirrors the C firmware's decoding logic in Python, implementing heatmap argmax extraction for MoveNet and Non-Maximum Suppression (NMS) for YOLO models.
The model inference runner. Executes predictions across the test dataset, supporting multiple execution contexts: local CPU (host), ST Edge AI host simulation (stedgeai_host), or hardware-in-the-loop on the NPU (stedgeai_n6).
The statistical evaluation engine. Calculates standardized accuracy metrics, including Object Keypoint Similarity (OKS) for MoveNet and mean Average Precision (mAP@[0.5:0.95]) for YOLOv8, adhering strictly to the COCO evaluation protocol.