STM32N6 NPU Deployment — Politecnico di Milano  1.0
Documentation for Neural Network Deployment on STM32N6 NPU - Politecnico di Milano 2024-2025
Chapter 1 — Introduction

Chapter 1 — Introduction

From Edge Computing
to Neural Network Deployment

A guided journey from the fundamentals of embedded computing and artificial intelligence to the practical challenge of running a neural network on a microcontroller in real time.

🔔 Before we start — a note for everyone
If you have never heard of microcontrollers, neural networks, or quantization — do not worry. Unlike most official documentation, this guide is designed to be read by anyone curious enough to try. We will build every concept from scratch, combining theory and practice in a single narrative. We will look at the big picture first, then zoom in — all the way to reading the code line by line, understanding what happens, where, and why.

So, without fear — let’s begin.
Part 1
From Edge Computing to the STM32N6570-DK

1.0 — Edge Computing

We begin by understanding the field we are working in: Edge Computing. According to Karim Arabi (IEEE DAC, 2014), edge computing encompasses "all computation outside the cloud that occurs at the edges of the network and, more specifically, in applications where real-time data processing is required."

In plain terms: edge computing is computation that happens physically close to where data is generated — inside a sensor, a camera, or a smartphone — rather than being sent to a remote server (the cloud) for processing. The key benefit is immediacy: when the computation happens locally, there is no network delay, no dependency on connectivity, and no privacy risk from transmitting raw data.

CLOUD COMPUTING 📷 Camera data 🌐 Internet ~100ms ☁ Cloud Server Latency: HIGH Result: delayed Needs connection EDGE COMPUTING 📷 Camera data ⚡ MCU + NPU Latency: <20ms Result: immediate No internet needed VS
Cloud computing sends raw data to remote servers across two network hops (high latency); edge computing processes data locally on the device, in a single step (real-time).

1.1 — Embedded Systems

What makes edge computing physically possible is the existence of embedded systems — the hardware on which local computation runs. An embedded system is a computing system designed to perform one specific function within a larger apparatus. Unlike a PC — which must be able to do everything — an embedded system is built for a single purpose: controlling a drone, managing ABS braking, or monitoring reactor temperature.

Depending on the complexity of the task, the designer chooses between two main types of processing unit: a Microcontroller (MCU) or a Microprocessor (MPU). Understanding the difference is essential, because the protagonist of our project — the STM32N6570-DK — is an MCU.

⚡ Microcontroller (MCU)
  • Everything on a single chip: CPU, RAM, Flash, peripherals
  • Runs a simple firmware directly — no operating system needed
  • Very low power consumption (µW to mW range)
  • Deterministic, real-time behaviour
  • Typical RAM: KB to a few MB
  • Our board: STM32N6570-DK
🖥 Microprocessor (MPU)
  • CPU only — needs external RAM, storage, peripherals
  • Runs a full OS (Linux, Android)
  • Higher performance, higher power draw (W range)
  • More flexible — can run multiple applications simultaneously
  • Typical RAM: hundreds of MB to GB
  • Example: Raspberry Pi, STM32MP257

1.2 — Inside a Microcontroller

A microcontroller integrates an entire computing system on a single silicon chip. Think of it as a tiny city: different districts handle different responsibilities, connected by internal roads (buses) along which data travels. Here are the main components and their roles — described with the analogy we found most useful during our own learning.

MICROCONTROLLER (MCU) — Single Chip CPU Core ALU FPU Cache L1 (I + D) Memory SRAM Flash TCM (fast path) NPU — Neural-ART CONVACC ×4 POOL / ACTIV 600 GOPS INT8 AXI Bus — Internal Communication Highway Peripherals & I/O GPIO/UART MIPI CSI-2 (camera) Timers ADC / DAC (analog) LTDC (display) XSPI (flash/RAM) ↓ Pins connect to: camera, display, sensors, flash memory, USB
Internal architecture of a microcontroller. The AXI bus sits at the centre, connecting compute and memory (above) with peripherals (below). The STM32N6 adds a dedicated NPU (Neural-ART) for AI acceleration.
Component Analogy Role
CPU The Manager Reads instructions, coordinates all other components, decides what runs when.
ALU The Integer Accountant Handles integer arithmetic (add, subtract) and logical operations (AND, OR, NOT).
FPU The Decimal Specialist Handles floating-point numbers with precision. Critical for neural network output decoding (postprocessing).
SRAM The Desk Fast working memory for data currently in use. Volatile — cleared on power off. On the STM32N6: 4.2 MB split across banks.
Flash The Secure Archive Non-volatile storage for firmware and model weights. On the STM32N6: 128 MB OctoFlash — holds all three of our models.
Cache L1 The Jacket Pocket Tiny, ultra-fast memory right next to the CPU. Stores frequently accessed data to avoid going back to SRAM every time.
TCM The Express Lane Memory directly wired to the CPU, bypassing the AXI bus. Used for interrupt handlers and time-critical code.
Peripherals The Senses GPIO, MIPI CSI-2 (camera), LTDC (display), XSPI (external flash) — connect the chip to the physical world.
NPU The AI Expert Built specifically for neural network inference. While the CPU struggles with billions of multiplications per second, the NPU is designed exactly for this. On the STM32N6: 600 GOPS INT8 — this is the star of our project.
Part 2
From Artificial Intelligence to Neural Networks

2.0 — Artificial Intelligence

Artificial Intelligence (AI) is the broad field concerned with building systems that can perform tasks that would normally require human intelligence: recognising objects, understanding language, making decisions under uncertainty. The term dates back to 1956 but has seen explosive growth in the last decade, driven almost entirely by one of its subfields: Machine Learning.

Artificial Intelligence Any technique enabling machines to mimic human intelligence Machine Learning Systems that learn from data without explicit programming Deep Learning Neural networks with many layers — our focus
Deep Learning is a subset of Machine Learning, which is a subset of AI. Our project lives entirely within the Deep Learning circle.

2.1 — Machine Learning

Machine Learning (ML) is the subfield of AI where systems learn from data rather than following hand-written rules. Instead of a programmer writing "if the image contains pointy ears and fur, it is a cat", a machine learning system is shown thousands of cat images and learns on its own which patterns matter.

The learning happens through a process called training: the system is repeatedly shown examples, makes a prediction, receives feedback on how wrong it was (the loss), and adjusts its internal parameters to do better next time. After thousands of iterations, the parameters converge to values that make the system perform well on new, unseen data. These final parameter values are what we call a model.

2.2 — Deep Learning and Neural Networks

Deep Learning is a branch of Machine Learning that uses artificial neural networks — computational models loosely inspired by the structure of the human brain — as their learning architecture. The word "deep" refers to the many layers these networks are composed of: each layer transforms the data slightly, and together they build up increasingly abstract representations.

Input data (image, text...) Neural Network (many layers) Prediction (e.g. "cat 92%") Loss (how wrong?) Update weights ← backprop Repeat for thousands of iterations until loss converges

2.2.1 — Types of Neural Networks: CNNs and Transformers

Not all neural networks are built the same way. Two architectures dominate modern deep learning — and understanding their structural difference is the key to understanding our results:

CNN — Convolutional Neural Network

Processes data through sliding convolutional filters that scan the input spatially. Each filter detects a local pattern (an edge, a texture, a shape). Layers are stacked to detect increasingly complex features.

Core operation: convolution (sum of element-wise products)
Strength: spatial data — images, video
NPU suitability: Excellent
Our models: MoveNet Lightning, YOLOv8n-pose
Transformer

Processes data through self-attention: every element in the input attends to every other element simultaneously, capturing global dependencies regardless of distance. Originally designed for text, now dominant in language models.

Core operation: matrix multiplication (Q·Kᵀ), Softmax
Strength: sequential data — text, audio
NPU suitability: Partial
Our model: TinyBERT

2.2.2 — Pretrained Models and Inference

Training a neural network from scratch requires enormous datasets, powerful GPUs, and days or weeks of computation. In most real-world applications — and in this project — we skip training entirely and use pretrained models: models that have already been trained by research teams on large datasets (ImageNet, COCO) and whose weights have been made publicly available.

Once a model is trained, using it to make predictions on new data is called inference. Inference is computationally much cheaper than training — there is no backpropagation, no gradient computation, no weight update. The model's weights are frozen; we only run the forward pass. This is what our board does: it receives a camera frame and runs inference to produce keypoint coordinates — all in under 32 ms.

Part 3
Edge AI — The Intersection

3.0 — What is Edge AI?

Edge AI is the combination of the two worlds we have described: running AI inference (neural network forward pass) directly on an embedded device at the edge, without sending data to the cloud. The promise is compelling — real-time response, offline operation, data privacy — but realising it requires solving two fundamental problems: size and hardware compatibility with the target accelerator.

When we talk about deployment in this context, we mean the complete process of taking a trained neural network model and making it run correctly and efficiently on a specific piece of embedded hardware. This is not as simple as copying a file: the model must be converted, compressed, compiled into C code, and flashed onto the device. The rest of this documentation describes exactly how.

3.1 — Why Pretrained Models Cannot Run on a Microcontroller — as they are

The models produced by training on a GPU — and the ones you download from repositories like HuggingFace or TensorFlow Hub — store every weight as a 32-bit floating-point number (float32). This format provides excellent numerical precision but has a critical cost: 4 bytes per value.

Consider MoveNet Lightning, a model deliberately designed for edge deployment: in float32 it still weighs ~11.7 MB. And size is only half the problem. The STM32N6 NPU (Neural-ART) only accelerates integer operations: a float32 model cannot use it at all — every layer falls back to the general-purpose CPU, making real-time inference impossible for any non-trivial model.

MoveNet Lightning (float32) ~11.7 MB (float32) ✖ NPU-incompatible — every layer falls back to the CPU larger MoveNet Lightning (INT8 quantized) 2.9 MB ✓ NPU-ready — 600 GOPS hardware acceleration
Quantization shrinks the model 4× and unlocks NPU acceleration — the STM32N6 Neural-ART only accelerates INT8 operations.

The solution is quantization: convert the weights from float32 to INT8. The model shrinks 4× and becomes NPU-compatible in a single step.

3.2 — INT8 Quantization

Quantization converts a neural network from high-precision floating-point (float32, 4 bytes per value) to low-precision integer format (INT8, 1 byte per value). This is not just compression — it is a change in numerical representation that must preserve the model's behaviour as closely as possible.

Smaller model
4 bytes → 1 byte per weight
~5×
Faster inference
INT8 vs FP32 arithmetic
<1%
Accuracy loss
With calibration dataset

The quantization formula maps each float32 value to an INT8 integer:

xINT8  =  round  (  xfloat32  /  S  +  Z  )
S = scale factor  |  Z = zero-point  |  both determined during calibration on a representative dataset
Post-Training Quantization (PTQ) Used in this project

Applied after training. A small calibration dataset (a few hundred representative images) is run through the model to measure the range of each activation. Scale and zero-point are computed from these statistics. No retraining required. Slight accuracy loss — acceptable for most embedded applications.

Quantization-Aware Training (QAT)

Simulates quantization during training. The model learns to operate under low-precision constraints and compensates for rounding errors. Better accuracy than PTQ, but requires access to the full training pipeline and retraining time. Not used in this project.

Model formats used in this project

Format Framework Quantization Used for
.tflite TensorFlow Lite INT8 (PTQ and QAT) MoveNet Lightning, YOLOv8n-pose
.onnx Open Neural Network Exchange QDQ nodes TinyBERT
.h5 Keras / TensorFlow FP32 only (requires conversion) Training intermediate
Important: The STM32N6 NPU only accelerates INT8 operations. A float32 model cannot use the NPU hardware at all — every layer would execute on the CPU, making real-time inference impossible for any non-trivial model.
Part 4
Putting It All Together — Our Objective

We now have all the pieces. Let us restate the objective of this project with full clarity:

Deploy three pretrained neural network models of increasing architectural complexity on the STM32N6570-DK — a microcontroller with a dedicated Neural-ART NPU — and measure how well the NPU handles each architecture. Two models are CNN-based (MoveNet Lightning and YOLOv8n-pose); one is a Transformer (TinyBERT). Each model is quantized to INT8, converted to optimised C code by ST Edge AI Core, compiled and flashed via STM32CubeIDE, and validated with live camera input on the board.

With this objective clearly in mind, the next chapter examines the hardware itself in detail — the board, the NPU architecture, and the memory hierarchy that shapes every deployment decision we made.

Next: Chapter 2 — Hardware →