STM32N6 NPU Deployment — Politecnico di Milano  1.0
Documentation for Neural Network Deployment on STM32N6 NPU - Politecnico di Milano 2024-2025
Chapter 2 — Hardware: The STM32N6570-DK

Chapter 2 — Hardware

The STM32N6570-DK
Discovery Kit

A detailed tour of the hardware platform — from the physical board to the silicon architecture of the Neural-ART NPU — building the mental model needed to understand every deployment decision in this project.

Cortex-M55 • 800 MHz
Neural-ART NPU • 600 GOPS
4.2 MB SRAM • 128 MB Flash
VFBGA264 • STM32N657X0H3Q
Section 2.0
The STM32N6570-DK Discovery Kit — What It Is

The STM32N6570-DK Discovery Kit is a complete development and demonstration platform built around the STM32N657X0H3Q microcontroller — the first STMicroelectronics MCU to integrate a dedicated Neural Processing Unit (NPU) for on-device AI inference.

A Discovery Kit is not just a chip — it is a complete system that surrounds the microcontroller with all the peripherals needed to explore its capabilities immediately: a camera, a display, external memory, a debugger, and expansion connectors. This means we can run a full pose estimation pipeline — camera input, AI inference, display output — out of the box, without designing any custom hardware.

STM32N6570-DK top view
STM32N6570-DK — Top view
STM32N6570-DK bottom view
STM32N6570-DK — Bottom view
Source: STMicroelectronics UM3300 User Manual, December 2024
Section 2.1
Board Components — What Is on the Board and Why

Understanding the board layout is important because many of the choices in our firmware — where data lives, how the camera feeds the NPU, why we need two XSPI interfaces — are direct consequences of the physical hardware. Let us walk through the key components.

STM32N6570-DK STM32N657X0 Cortex-M55 + NPU VFBGA264 800 MHz / 600 GOPS OctoFlash 128 MB (weights) HexaRAM 32 MB (LCD buf) LCD 5" 800×480 touch Camera MIPI CSI-2 STLINK-V3EC Debugger XSPI2 — model weights Mapped at 0x70000000 USB-C + SWD Flash & debug XSPI1 — LCD buffer Mapped at 0x90000000 B-CAMS-IMX module plugs into connector Boot Switches RIGHT=flash / LEFT=run
Annotated board diagram showing the key components and their connections to the MCU.
Component Specification Role in this project
STM32N657X0H3Q Cortex-M55 @ 800 MHz, Neural-ART NPU, 4.2 MB SRAM Runs all inference, camera control, display output
OctoFlash (XSPI2) 1 Gbit = 128 MB, OPI mode, DTR transfer Stores all model weights. Mapped at 0x70000000. Streamed to NPU during inference.
HexaRAM PSRAM (XSPI1) 256 Mbit = 32 MB, HexaDecaSPI Stores LCD framebuffers (800×480×2 bytes each). Too large for internal SRAM.
LCD Display 5", 800×480, capacitive touch, RGB565/ARGB4444 Shows camera preview (Layer 1) and skeleton overlay (Layer 2) simultaneously.
Camera connector MIPI CSI-2, 2-lane Receives frames from B-CAMS-IMX. Feeds DCMIPP for dual-pipe processing.
STLINK-V3EC Embedded debugger/programmer, USB Virtual COM Used to flash firmware from STM32CubeIDE via USB. Also provides serial debug output.
Boot Switches 2-position slide switches RIGHT = flashing mode (needed before deployment). LEFT = run mode (needed to start inference).
Section 2.2
The B-CAMS-IMX Camera Module
B-CAMS-IMX camera module
B-CAMS-IMX module

The B-CAMS-IMX is the camera daughter board included with the STM32N6570-DK. It carries a Sony IMX sensor and connects to the main board via the MIPI CSI-2 2-lane interface — the same interface used in smartphones for high-speed camera data transfer.

MIPI CSI-2 is a serial differential protocol that transmits pixel data at very high bandwidth with very few wires. Once the data arrives at the MCU, it is processed by the DCMIPP — a dedicated hardware block described in Section 2.6 — which handles format conversion, cropping, and routing to memory without CPU involvement.

Why this matters for our project: the camera delivers raw RGB565 frames. The DCMIPP simultaneously routes one copy to the PSRAM framebuffer (for the LCD background) and one cropped copy to the NPU input buffer (for inference) — all in hardware, without the CPU spending a single cycle on memory copies.
Section 2.3
The STM32N657X0H3Q Microcontroller

The heart of the board is the STM32N657X0H3Q — a 264-ball VFBGA package integrating a complete system-on-chip. Let us zoom in from the package all the way to the silicon.

STM32N657X0H3Q — System On Chip Cortex-M55 800 MHz ALU + FPU Helium SIMD Cache I/D 32 KB each Neural-ART NPU 600 GOPS INT8 4× CONVACC POOL / ACTIV 10× STRENG (stream engine) Memory Subsystem SRAM 4.2 MB AXI Cache NPU XSPI1 / XSPI2 controllers H264 Encoder NeoChrom 2.5D GPU LTDC display controller Peripherals DCMIPP • USB HS • Ethernet TSN • CAN FD • SPI/I2C/UART • ADC/DAC • Timers SDMMC • JTAG/SWD • ETM • GPIO AXI / AHB INTERCONNECT BUS
STM32N657X0H3Q internal block diagram. The Neural-ART NPU sits alongside the Cortex-M55 CPU on the same AXI interconnect.

Three aspects of this chip are worth highlighting for our project:

Cortex-M55 + Helium

The M55 is the first Cortex-M core with Helium (MVE) — a SIMD vector extension for signal processing and ML. It allows the CPU to process 16 INT8 values per clock cycle, which is critical for the SW epochs that fall back from the NPU.

Neural-ART NPU

Not a GPU. Not a general-purpose accelerator. A fixed-function hardware block designed specifically for CNN inference — with dedicated units for convolution, pooling, and activation. We will dissect it in detail in Section 2.5.

H264 + NeoChrom GPU

Dedicated video encoder and 2.5D graphics accelerator. Not used in this project but relevant context: this chip is designed for complete vision pipelines — capture, encode, analyse, display — all on a single MCU.

Section 2.4
Memory Architecture — Where Everything Lives

Understanding the memory hierarchy is not optional — it is the key to understanding why the firmware is written the way it is. Every buffer placement decision in main.c, every GCC section attribute in network_data_params.c, and every cache invalidation call is a direct consequence of this memory map.

Memory Hierarchy — STM32N6570-DK L1 Cache — 32 KB I + 32 KB D <1 cycle access • inside Cortex-M55 miss Internal SRAM — 4.2 MB total (contiguous) AXISRAM1-2 CPU activations npuRAM3 0x34200000 npuRAM4 nn_in buffer npuRAM5 nn_out buffer npuRAM6 0x34350000 FLEXRAM/TCM critical code miss OctoFlash — 128 MB Mapped at 0x70000000 via XSPI2 Model weights • firmware • read-only HexaRAM PSRAM — 32 MB Mapped at 0x90000000 via XSPI1 LCD framebuffers (2 × 768 KB) • read-write 1 cycle ~10 cycles ~40 cycles ~20 cycles NPU AXI Cache — dedicated cache for weight streaming from OctoFlash Without this cache, weight streaming from 40-cycle Flash would bottleneck the 600 GOPS CONVACC units Latency increases as we go down the hierarchy. The firmware minimises cache misses by placing hot data in SRAM.
Region Address Size Used for — MoveNet
npuRAM4 0x34270000 448 KB nn_in: camera frame 192×192×3 = 110 KB. Input to NPU.
npuRAM5 0x342E0000 448 KB nn_out: heatmaps 48×48×13 float32 = 120 KB. Output from NPU.
OctoFlash 0x70380000 2.9 MB Model weights. Read-only. Streamed to CONVACC via NPU AXI Cache.
PSRAM 0x90000000 ~1.5 MB LCD background buffer (800×480×2) + foreground double-buffer.
Section 2.5
The Neural-ART NPU — Architecture Deep Dive

This is the most important section of the hardware chapter — and the one that directly explains our experimental results. The Neural-ART NPU is not a general-purpose processor. It is a fixed-function hardware accelerator designed with a very specific computation pattern in mind: the convolution operation that dominates CNN inference.

To understand why it is built this way, we need to understand what a convolution actually does at the hardware level.

What a convolution looks like in hardware

A 2D convolution slides a small filter (e.g. 3×3 pixels) across an input feature map. At each position, it computes a dot product: multiply each filter weight by the corresponding input pixel and sum them all. For a single output pixel with 32 input channels: 3 × 3 × 32 = 288 multiply-accumulate (MAC) operations. For a full 192×192 feature map with 64 output channels: 192 × 192 × 288 × 64 ≈ 680 million MACs per layer.

This is a regular, predictable, data-parallel computation — the exact type that dedicated hardware handles far better than a general CPU. The CONVACC unit is built to execute thousands of these MACs per clock cycle by doing them in parallel.

Neural-ART NPU — Internal Architecture CONVACC — 4 units (convolution accelerator) CA #1 CA #2 CA #3 CA #4 Each unit: parallel INT8 MACs • handles CONV2D, DEPTHWISE, FC STRENG — 10 units (stream engine) Prefetch and format weight/activation data for CONVACC Reads from OctoFlash via AXI cache → feeds CONVACC without stalls 10 parallel streams • optimised for sequential access patterns CNN weight access is sequential (layer by layer) — perfect for stream engines POOL MaxPool, AvgPool supported by NPU ACTIV ReLU, ReLU6, Sigmoid supported by NPU NORM / ADD BatchNorm, residual add supported by NPU NOT on NPU → CPU fallback Softmax • LayerNorm MatMul • Resize bilinear Transformer ops → SW epoch Data flow during inference: OctoFlash (weights) → AXI Cache → STRENG → CONVACC → npuRAM (activations) → repeat nn_in (npuRAM4) → Layer 0 → intermediate activations → nn_out (npuRAM5)
Neural-ART NPU functional units. Red box = operations not supported by NPU hardware (CPU fallback).
✓ Executed on NPU (EC epoch)
  • Conv2D (any kernel size)
  • DepthwiseConv2D
  • PointwiseConv (1×1)
  • MaxPool, AveragePool
  • ReLU, ReLU6, Sigmoid, Tanh
  • BatchNorm, InstanceNorm
  • Element-wise Add (residual)
  • Fully Connected (as Conv)
✗ Falls back to CPU (SW epoch)
  • Softmax — required by attention in Transformers
  • LayerNormalization — used in every Transformer block
  • MatMul — Q·Kᵀ attention product
  • Resize bilinear — used in MoveNet decoder
  • Dequantize (INT8 → float32)
  • Transpose, Reshape (some patterns)
  • 1×1 conv when expressed as Dense
Section 2.6
The DCMIPP — Dual-Pipe Camera Pipeline

The DCMIPP (Digital Camera Memory Interface Pixel Pipeline) is a dedicated hardware block that sits between the MIPI CSI-2 camera interface and the MCU's memory. Its key feature for our project is the ability to run two simultaneous output pipes from a single camera input.

B-CAMS-IMX Sony IMX MIPI CSI-2 2-lane DCMIPP ISP + crop + resize Display Pipe NN Pipe PSRAM — lcd_bg_buffer 800×480×2 bytes continuous DMA — LTDC reads it AXISRAM — nn_in 192×192×3 bytes snapshot trigger — NPU reads it LCD 5" (LTDC) background layer NPU inference LL_ATON_RT_Main() Both pipes run simultaneously — display updates continuously while NPU processes each frame
Why two pipes matter: The display pipe runs in continuous mode — the LCD always shows a live camera preview at the native resolution. The NN pipe runs in snapshot mode — one frame is captured and cropped to the neural network input size (192×192 for MoveNet) when triggered by the firmware. This means inference never blocks the display, and the display never blocks inference. Both happen in hardware, with zero CPU involvement for the memory transfers.
Section 2.7
NPU Offload — What the Numbers Mean

Now that we understand the NPU architecture, we can predict — and explain — the results we will measure in the Case Studies chapter. The key metric is the NPU offload rate: the fraction of computational epochs executed on the NPU hardware rather than falling back to the CPU.

ST Edge AI Core divides the model computation into a sequence of epochs — not training epochs, but execution blocks. Each epoch is assigned to either the NPU (EC epoch) or the CPU (SW epoch) depending on whether the NPU hardware supports that operation.

MoveNet Lightning — CNN (75 total epochs) 71 EC epochs — NPU (94.7%) 4 SW YOLOv8n-pose — CNN (149 total epochs) 131 EC epochs — NPU (87.9%) 18 SW TinyBERT — Transformer (270 total epochs) 174 EC (64.4%) 96 SW epochs — Softmax, LayerNorm, MatMul on CPU (35.6%)

The pattern is clear and directly explained by the architecture:

MoveNet — 94.7%

Pure CNN backbone. Only 4 CPU epochs: 3× bilinear resize in the decoder and 1× dequantize output. The NPU was designed exactly for this.

YOLOv8n — 87.9%

CNN backbone (all NPU) + post-processing head (18 CPU epochs: reshape, softmax, transpose, slice). Heavy convolutions on NPU — real-time achieved.

TinyBERT — 64.4%

96 CPU epochs across every Transformer block: Softmax (attention), LayerNorm, MatMul. These are the core operations — not post-processing. Real-time impossible.

Key architectural insight
The Neural-ART NPU is not just CNN-optimised — it is CNN-designed. The CONVACC units exploit the spatial regularity of convolution: a 3×3 filter slides across the image in a predictable, sequential pattern that the STRENG stream engines can prefetch perfectly. Transformer attention requires global token interactions — every token must attend to every other token simultaneously — a fundamentally different access pattern that defeats the stream engine architecture. This is why 35% of TinyBERT's epochs fall back to the CPU, and why the latency exceeds 100 ms despite having 3× fewer total MACCs per layer than the CNN models.
← Chapter 1 — Introduction Next: Chapter 3 — Toolchain →