|
STM32N6 NPU Deployment — Politecnico di Milano
1.0
Documentation for Neural Network Deployment on STM32N6 NPU - Politecnico di Milano 2024-2025
|
A guided journey from the fundamentals of embedded computing and artificial intelligence to the practical challenge of running a neural network on a microcontroller in real time.
We begin by understanding the field we are working in: Edge Computing. According to Karim Arabi (IEEE DAC, 2014), edge computing encompasses "all computation outside the cloud that occurs at the edges of the network and, more specifically, in applications where real-time data processing is required."
In plain terms: edge computing is computation that happens physically close to where data is generated — inside a sensor, a camera, or a smartphone — rather than being sent to a remote server (the cloud) for processing. The key benefit is immediacy: when the computation happens locally, there is no network delay, no dependency on connectivity, and no privacy risk from transmitting raw data.
What makes edge computing physically possible is the existence of embedded systems — the hardware on which local computation runs. An embedded system is a computing system designed to perform one specific function within a larger apparatus. Unlike a PC — which must be able to do everything — an embedded system is built for a single purpose: controlling a drone, managing ABS braking, or monitoring reactor temperature.
Depending on the complexity of the task, the designer chooses between two main types of processing unit: a Microcontroller (MCU) or a Microprocessor (MPU). Understanding the difference is essential, because the protagonist of our project — the STM32N6570-DK — is an MCU.
A microcontroller integrates an entire computing system on a single silicon chip. Think of it as a tiny city: different districts handle different responsibilities, connected by internal roads (buses) along which data travels. Here are the main components and their roles — described with the analogy we found most useful during our own learning.
| Component | Analogy | Role |
|---|---|---|
| CPU | The Manager | Reads instructions, coordinates all other components, decides what runs when. |
| ALU | The Integer Accountant | Handles integer arithmetic (add, subtract) and logical operations (AND, OR, NOT). |
| FPU | The Decimal Specialist | Handles floating-point numbers with precision. Critical for neural network output decoding (postprocessing). |
| SRAM | The Desk | Fast working memory for data currently in use. Volatile — cleared on power off. On the STM32N6: 4.2 MB split across banks. |
| Flash | The Secure Archive | Non-volatile storage for firmware and model weights. On the STM32N6: 128 MB OctoFlash — holds all three of our models. |
| Cache L1 | The Jacket Pocket | Tiny, ultra-fast memory right next to the CPU. Stores frequently accessed data to avoid going back to SRAM every time. |
| TCM | The Express Lane | Memory directly wired to the CPU, bypassing the AXI bus. Used for interrupt handlers and time-critical code. |
| Peripherals | The Senses | GPIO, MIPI CSI-2 (camera), LTDC (display), XSPI (external flash) — connect the chip to the physical world. |
| NPU | The AI Expert | Built specifically for neural network inference. While the CPU struggles with billions of multiplications per second, the NPU is designed exactly for this. On the STM32N6: 600 GOPS INT8 — this is the star of our project. |
Artificial Intelligence (AI) is the broad field concerned with building systems that can perform tasks that would normally require human intelligence: recognising objects, understanding language, making decisions under uncertainty. The term dates back to 1956 but has seen explosive growth in the last decade, driven almost entirely by one of its subfields: Machine Learning.
Machine Learning (ML) is the subfield of AI where systems learn from data rather than following hand-written rules. Instead of a programmer writing "if the image contains pointy ears and fur, it is a cat", a machine learning system is shown thousands of cat images and learns on its own which patterns matter.
The learning happens through a process called training: the system is repeatedly shown examples, makes a prediction, receives feedback on how wrong it was (the loss), and adjusts its internal parameters to do better next time. After thousands of iterations, the parameters converge to values that make the system perform well on new, unseen data. These final parameter values are what we call a model.
Deep Learning is a branch of Machine Learning that uses artificial neural networks — computational models loosely inspired by the structure of the human brain — as their learning architecture. The word "deep" refers to the many layers these networks are composed of: each layer transforms the data slightly, and together they build up increasingly abstract representations.
Not all neural networks are built the same way. Two architectures dominate modern deep learning — and understanding their structural difference is the key to understanding our results:
Processes data through sliding convolutional filters that scan the input spatially. Each filter detects a local pattern (an edge, a texture, a shape). Layers are stacked to detect increasingly complex features.
Processes data through self-attention: every element in the input attends to every other element simultaneously, capturing global dependencies regardless of distance. Originally designed for text, now dominant in language models.
Training a neural network from scratch requires enormous datasets, powerful GPUs, and days or weeks of computation. In most real-world applications — and in this project — we skip training entirely and use pretrained models: models that have already been trained by research teams on large datasets (ImageNet, COCO) and whose weights have been made publicly available.
Once a model is trained, using it to make predictions on new data is called inference. Inference is computationally much cheaper than training — there is no backpropagation, no gradient computation, no weight update. The model's weights are frozen; we only run the forward pass. This is what our board does: it receives a camera frame and runs inference to produce keypoint coordinates — all in under 32 ms.
Edge AI is the combination of the two worlds we have described: running AI inference (neural network forward pass) directly on an embedded device at the edge, without sending data to the cloud. The promise is compelling — real-time response, offline operation, data privacy — but realising it requires solving two fundamental problems: size and hardware compatibility with the target accelerator.
When we talk about deployment in this context, we mean the complete process of taking a trained neural network model and making it run correctly and efficiently on a specific piece of embedded hardware. This is not as simple as copying a file: the model must be converted, compressed, compiled into C code, and flashed onto the device. The rest of this documentation describes exactly how.
The models produced by training on a GPU — and the ones you download from repositories like HuggingFace or TensorFlow Hub — store every weight as a 32-bit floating-point number (float32). This format provides excellent numerical precision but has a critical cost: 4 bytes per value.
Consider MoveNet Lightning, a model deliberately designed for edge deployment: in float32 it still weighs ~11.7 MB. And size is only half the problem. The STM32N6 NPU (Neural-ART) only accelerates integer operations: a float32 model cannot use it at all — every layer falls back to the general-purpose CPU, making real-time inference impossible for any non-trivial model.
The solution is quantization: convert the weights from float32 to INT8. The model shrinks 4× and becomes NPU-compatible in a single step.
Quantization converts a neural network from high-precision floating-point (float32, 4 bytes per value) to low-precision integer format (INT8, 1 byte per value). This is not just compression — it is a change in numerical representation that must preserve the model's behaviour as closely as possible.
The quantization formula maps each float32 value to an INT8 integer:
Applied after training. A small calibration dataset (a few hundred representative images) is run through the model to measure the range of each activation. Scale and zero-point are computed from these statistics. No retraining required. Slight accuracy loss — acceptable for most embedded applications.
Simulates quantization during training. The model learns to operate under low-precision constraints and compensates for rounding errors. Better accuracy than PTQ, but requires access to the full training pipeline and retraining time. Not used in this project.
| Format | Framework | Quantization | Used for |
|---|---|---|---|
| .tflite | TensorFlow Lite | INT8 (PTQ and QAT) | MoveNet Lightning, YOLOv8n-pose |
| .onnx | Open Neural Network Exchange | QDQ nodes | TinyBERT |
| .h5 | Keras / TensorFlow | FP32 only (requires conversion) | Training intermediate |
We now have all the pieces. Let us restate the objective of this project with full clarity:
With this objective clearly in mind, the next chapter examines the hardware itself in detail — the board, the NPU architecture, and the memory hierarchy that shapes every deployment decision we made.