Multidisciplinary Project — A.Y. 2024–2025
Neural Network Deployment
on STM32N6 NPU
A study on deploying CNN and Transformer architectures on the
STM32N6570-DK board and its Neural-ART NPU — exploring
real-time pose estimation, INT8 quantization, and the limits
of embedded AI at the edge.
MoveNet • 94.7% NPU • 18ms
YOLOv8n • 87.9% NPU • 32ms
TinyBERT • 64.4% NPU • >100ms
STM32N6570-DK • 600 GOPS • 800 MHz
Authors
Giacomo Colosio
Sebastiano Colosio
Patrizio Acquadro
Tito Nicola Drugman
What is this repository?
This is the code documentation for our Multidisciplinary Project at
Politecnico di Milano (A.Y. 2024–2025), supervised by
Prof. Cristina Silvano and Dr. Marco Ronzani.
We deployed three neural network models of increasing architectural complexity
on the STM32N6570-DK development board, which features the
Neural-ART NPU — a 600 GOPS hardware accelerator designed for embedded AI.
Our goal was to understand how well a CNN-centric NPU handles both
convolutional models (MoveNet, YOLOv8n) and
Transformer-based models (TinyBERT), and to document the
complete deployment pipeline from a quantized model file to real-time
inference running on the board.
This documentation is designed to be read as a guide: it combines
conceptual explanations, step-by-step deployment instructions, and
fully annotated source code — so that anyone can reproduce our
results or build on them.
How this documentation is organised
The documentation is divided into three parts, accessible from the navigation
menu on the left or from the cards below.
Part 1 — Narrative Explanations
Six chapters that walk you through the theory, the hardware,
the tools, and our findings — no code required.
Start here if you want to understand the project from scratch.
Part 2 — Code Reference
23 fully annotated source files — 13 Python and 10 C/H —
with function signatures, parameter descriptions, call graphs,
and inline comments explaining every non-obvious decision.
Part 3 — Module Groups
Files grouped by layer: Firmware (all .c/.h)
and PythonPipeline (all .py).
Useful when you want to see a specific layer in isolation.
Part 1 — Narrative Explanations contains six chapters.
Introduction explains the context: why Edge AI, what an NPU is,
how CNNs and Transformers differ at the hardware level, and why INT8
quantization matters for embedded deployment.
Hardware describes the STM32N6570-DK board and its Neural-ART NPU
in detail — the 14 functional units, the SRAM banks, the memory map,
and the boot switch procedure.
Toolchain walks through the four software tools (Model Zoo,
ModelZoo Services, ST Edge AI Core, STM32CubeIDE) and how to set them up
from scratch.
Deployment Workflow describes the five-step pipeline from model
selection to on-board validation.
Case Studies presents the three deployments with full profiling
data: epoch counts, NPU offload rates, memory usage per SRAM bank, and
measured inference latency.
Results & Analysis compares the three models head-to-head
and discusses what the numbers tell us about the NPU’s design.
Part 2 — Code Reference contains all 23 source files
documented with Doxygen. Every function has a @brief description,
@param and @return annotations, and inline comments
explaining every non-trivial step. For C functions, Doxygen automatically
generates call graphs (which functions it calls) and
caller graphs (which functions call it) via Graphviz,
making the execution flow immediately visible without reading the code.
Part 3 — Module Groups provides a thematic entry point:
the Firmware group collects all C/H files that run on the board,
and the PythonPipeline group collects all Python files that run
on the host PC. This is useful when you want to understand a single layer
without navigating the full file list.
Recommended reading path
If you are approaching this project for the first time, we suggest following
this order. Each step builds on the previous one, so that by the time you
reach the code you already have the mental model to understand it.
Introduction & Hardware
Part 1 — Ch. 1–2
Understand why MoveNet reaches 94.7% NPU offload while TinyBERT
only reaches 64.4% — the answer is in the NPU architecture.
Toolchain & Deployment Workflow
Part 1 — Ch. 3–4
Learn what each tool does and in what order.
After this, every line in the Python pipeline will make sense.
Python pipeline files
Part 2 — Namespaces
Follow the thread:
stm32ai_main.py →
deploy.py →
gen_h_file.py →
common_deploy.py.
This is the chain that takes a YAML file and produces a flashed board.
C firmware files
Part 2 — Files
Start with app_config.h (what was configured),
then main.c (the inference loop),
then app_camerapipeline.c and display_spe.c.
Case Studies & Results
Part 1 — Ch. 5–6
Connect the code to the real numbers:
full profiling tables, NPU epoch breakdowns,
and a cross-model comparison that explains our key finding.
Call graphs — deep dive on any function
Part 2 — any function page
For any function that is not immediately clear, open its page in
Part 2 and scroll to the call graph — it shows exactly
where it is called from and what it calls, without reading
a single line of code.
Supervised by:
Prof. Cristina Silvano and Dr. Marco Ronzani —
Advanced Computer Architectures (088949),
Dipartimento di Elettronica, Informazione e Bioingegneria,
Politecnico di Milano.