Chapter 1 — Introduction

From Edge Computing
to Neural Network Deployment

A guided journey from the fundamentals of embedded computing and artificial intelligence to the practical challenge of running a neural network on a microcontroller in real time.

🔔 Before we start — a note for everyone

If you have never heard of microcontrollers, neural networks, or quantization — do not worry. Unlike most official documentation, this guide is designed to be read by anyone curious enough to try. We will build every concept from scratch, combining theory and practice in a single narrative. We will look at the big picture first, then zoom in — all the way to reading the code line by line, understanding what happens, where, and why.

So, without fear — let’s begin.

Part 1

From Edge Computing to the STM32N6570-DK

1.0 — Edge Computing

We begin by understanding the field we are working in: Edge Computing. According to Karim Arabi (IEEE DAC, 2014), edge computing encompasses "all computation outside the cloud that occurs at the edges of the network and, more specifically, in applications where real-time data processing is required."

In plain terms: edge computing is computation that happens physically close to where data is generated — inside a sensor, a camera, or a smartphone — rather than being sent to a remote server (the cloud) for processing. The key benefit is immediacy: when the computation happens locally, there is no network delay, no dependency on connectivity, and no privacy risk from transmitting raw data.

Cloud computing sends raw data to remote servers across two network hops (high latency); edge computing processes data locally on the device, in a single step (real-time).

1.1 — Embedded Systems

What makes edge computing physically possible is the existence of embedded systems — the hardware on which local computation runs. An embedded system is a computing system designed to perform one specific function within a larger apparatus. Unlike a PC — which must be able to do everything — an embedded system is built for a single purpose: controlling a drone, managing ABS braking, or monitoring reactor temperature.

Depending on the complexity of the task, the designer chooses between two main types of processing unit: a Microcontroller (MCU) or a Microprocessor (MPU). Understanding the difference is essential, because the protagonist of our project — the STM32N6570-DK — is an MCU.

⚡ Microcontroller (MCU)

Everything on a single chip: CPU, RAM, Flash, peripherals
Runs a simple firmware directly — no operating system needed
Very low power consumption (µW to mW range)
Deterministic, real-time behaviour
Typical RAM: KB to a few MB
Our board: STM32N6570-DK

🖥 Microprocessor (MPU)

CPU only — needs external RAM, storage, peripherals
Runs a full OS (Linux, Android)
Higher performance, higher power draw (W range)
More flexible — can run multiple applications simultaneously
Typical RAM: hundreds of MB to GB
Example: Raspberry Pi, STM32MP257

1.2 — Inside a Microcontroller

A microcontroller integrates an entire computing system on a single silicon chip. Think of it as a tiny city: different districts handle different responsibilities, connected by internal roads (buses) along which data travels. Here are the main components and their roles — described with the analogy we found most useful during our own learning.

Internal architecture of a microcontroller. The AXI bus sits at the centre, connecting compute and memory (above) with peripherals (below). The STM32N6 adds a dedicated NPU (Neural-ART) for AI acceleration.

Component	Analogy	Role
CPU	The Manager	Reads instructions, coordinates all other components, decides what runs when.
ALU	The Integer Accountant	Handles integer arithmetic (add, subtract) and logical operations (AND, OR, NOT).
FPU	The Decimal Specialist	Handles floating-point numbers with precision. Critical for neural network output decoding (postprocessing).
SRAM	The Desk	Fast working memory for data currently in use. Volatile — cleared on power off. On the STM32N6: 4.2 MB split across banks.
Flash	The Secure Archive	Non-volatile storage for firmware and model weights. On the STM32N6: 128 MB OctoFlash — holds all three of our models.
Cache L1	The Jacket Pocket	Tiny, ultra-fast memory right next to the CPU. Stores frequently accessed data to avoid going back to SRAM every time.
TCM	The Express Lane	Memory directly wired to the CPU, bypassing the AXI bus. Used for interrupt handlers and time-critical code.
Peripherals	The Senses	GPIO, MIPI CSI-2 (camera), LTDC (display), XSPI (external flash) — connect the chip to the physical world.
NPU	The AI Expert	Built specifically for neural network inference. While the CPU struggles with billions of multiplications per second, the NPU is designed exactly for this. On the STM32N6: 600 GOPS INT8 — this is the star of our project.

Part 2

From Artificial Intelligence to Neural Networks

2.0 — Artificial Intelligence

Artificial Intelligence (AI) is the broad field concerned with building systems that can perform tasks that would normally require human intelligence: recognising objects, understanding language, making decisions under uncertainty. The term dates back to 1956 but has seen explosive growth in the last decade, driven almost entirely by one of its subfields: Machine Learning.

Deep Learning is a subset of Machine Learning, which is a subset of AI. Our project lives entirely within the Deep Learning circle.

2.1 — Machine Learning

Machine Learning (ML) is the subfield of AI where systems learn from data rather than following hand-written rules. Instead of a programmer writing "if the image contains pointy ears and fur, it is a cat", a machine learning system is shown thousands of cat images and learns on its own which patterns matter.

The learning happens through a process called training: the system is repeatedly shown examples, makes a prediction, receives feedback on how wrong it was (the loss), and adjusts its internal parameters to do better next time. After thousands of iterations, the parameters converge to values that make the system perform well on new, unseen data. These final parameter values are what we call a model.

2.2 — Deep Learning and Neural Networks

Deep Learning is a branch of Machine Learning that uses artificial neural networks — computational models loosely inspired by the structure of the human brain — as their learning architecture. The word "deep" refers to the many layers these networks are composed of: each layer transforms the data slightly, and together they build up increasingly abstract representations.

2.2.1 — Types of Neural Networks: CNNs and Transformers

Not all neural networks are built the same way. Two architectures dominate modern deep learning — and understanding their structural difference is the key to understanding our results:

CNN — Convolutional Neural Network

Processes data through sliding convolutional filters that scan the input spatially. Each filter detects a local pattern (an edge, a texture, a shape). Layers are stacked to detect increasingly complex features.

Core operation: convolution (sum of element-wise products)
Strength: spatial data — images, video
NPU suitability: Excellent
Our models: MoveNet Lightning, YOLOv8n-pose

Transformer

Processes data through self-attention: every element in the input attends to every other element simultaneously, capturing global dependencies regardless of distance. Originally designed for text, now dominant in language models.

Core operation: matrix multiplication (Q·Kᵀ), Softmax
Strength: sequential data — text, audio
NPU suitability: Partial
Our model: TinyBERT

2.2.2 — Pretrained Models and Inference

Training a neural network from scratch requires enormous datasets, powerful GPUs, and days or weeks of computation. In most real-world applications — and in this project — we skip training entirely and use pretrained models: models that have already been trained by research teams on large datasets (ImageNet, COCO) and whose weights have been made publicly available.

Once a model is trained, using it to make predictions on new data is called inference. Inference is computationally much cheaper than training — there is no backpropagation, no gradient computation, no weight update. The model's weights are frozen; we only run the forward pass. This is what our board does: it receives a camera frame and runs inference to produce keypoint coordinates — all in under 32 ms.

Part 3

Edge AI — The Intersection

3.0 — What is Edge AI?

Edge AI is the combination of the two worlds we have described: running AI inference (neural network forward pass) directly on an embedded device at the edge, without sending data to the cloud. The promise is compelling — real-time response, offline operation, data privacy — but realising it requires solving two fundamental problems: size and hardware compatibility with the target accelerator.

When we talk about deployment in this context, we mean the complete process of taking a trained neural network model and making it run correctly and efficiently on a specific piece of embedded hardware. This is not as simple as copying a file: the model must be converted, compressed, compiled into C code, and flashed onto the device. The rest of this documentation describes exactly how.

3.1 — Why Pretrained Models Cannot Run on a Microcontroller — as they are

The models produced by training on a GPU — and the ones you download from repositories like HuggingFace or TensorFlow Hub — store every weight as a 32-bit floating-point number (float32). This format provides excellent numerical precision but has a critical cost: 4 bytes per value.

Consider MoveNet Lightning, a model deliberately designed for edge deployment: in float32 it still weighs ~11.7 MB. And size is only half the problem. The STM32N6 NPU (Neural-ART) only accelerates integer operations: a float32 model cannot use it at all — every layer falls back to the general-purpose CPU, making real-time inference impossible for any non-trivial model.

Quantization shrinks the model 4× and unlocks NPU acceleration — the STM32N6 Neural-ART only accelerates INT8 operations.

The solution is quantization: convert the weights from float32 to INT8. The model shrinks 4× and becomes NPU-compatible in a single step.

3.2 — INT8 Quantization

Quantization converts a neural network from high-precision floating-point (float32, 4 bytes per value) to low-precision integer format (INT8, 1 byte per value). This is not just compression — it is a change in numerical representation that must preserve the model's behaviour as closely as possible.

4×

Smaller model

4 bytes → 1 byte per weight

~5×

Faster inference

INT8 vs FP32 arithmetic

<1%

Accuracy loss

With calibration dataset

The quantization formula maps each float32 value to an INT8 integer:

    xINT8
     = 
    round
     ( 
    xfloat32
     / 
    S
     + 
    Z
     )
  

    S = scale factor  | 
    Z = zero-point  | 
    both determined during calibration on a representative dataset
  

Post-Training Quantization (PTQ) Used in this project

Applied after training. A small calibration dataset (a few hundred representative images) is run through the model to measure the range of each activation. Scale and zero-point are computed from these statistics. No retraining required. Slight accuracy loss — acceptable for most embedded applications.

Quantization-Aware Training (QAT)

Simulates quantization during training. The model learns to operate under low-precision constraints and compensates for rounding errors. Better accuracy than PTQ, but requires access to the full training pipeline and retraining time. Not used in this project.

Model formats used in this project

Format	Framework	Quantization	Used for
.tflite	TensorFlow Lite	INT8 (PTQ and QAT)	MoveNet Lightning, YOLOv8n-pose
.onnx	Open Neural Network Exchange	QDQ nodes	TinyBERT
.h5	Keras / TensorFlow	FP32 only (requires conversion)	Training intermediate

Important: The STM32N6 NPU only accelerates INT8 operations. A float32 model cannot use the NPU hardware at all — every layer would execute on the CPU, making real-time inference impossible for any non-trivial model.

Part 4

Putting It All Together — Our Objective

We now have all the pieces. Let us restate the objective of this project with full clarity:

Deploy three pretrained neural network models of increasing architectural complexity on the STM32N6570-DK — a microcontroller with a dedicated Neural-ART NPU — and measure how well the NPU handles each architecture. Two models are CNN-based (MoveNet Lightning and YOLOv8n-pose); one is a Transformer (TinyBERT). Each model is quantized to INT8, converted to optimised C code by ST Edge AI Core, compiled and flashed via STM32CubeIDE, and validated with live camera input on the board.

With this objective clearly in mind, the next chapter examines the hardware itself in detail — the board, the NPU architecture, and the memory hierarchy that shapes every deployment decision we made.

Next: Chapter 2 — Hardware →

From Edge Computingto Neural Network Deployment