STM32N6 NPU Deployment — Politecnico di Milano  1.0
Documentation for Neural Network Deployment on STM32N6 NPU - Politecnico di Milano 2024-2025
3.3 — ST Edge AI Core

3.3 — ST Edge AI Core

The Python-to-C Bridge

ST Edge AI Core is the converter tool that takes a quantized neural network model (.tflite, .onnx) and produces optimised C source files, assigns each layer to either the NPU or the CPU, and generates a detailed profiling report. This is where Python ends and C begins.

v2.1.0 used in this project
75 epochs • 71 EC • 4 SW
2.924 MB weights • OctoFlash

What is ST Edge AI Core?

A neural network model file — even a quantized INT8 .tflite — cannot run directly on a microcontroller. It is a portable format that needs to be compiled into something a microcontroller can actually execute. ST Edge AI Core is the tool that performs this translation.

It does four things in one pass:

1. Graph optimization

Fuses adjacent operations (Conv + BatchNorm + ReLU into a single kernel), removes redundant transposes, and reformats tensor layouts for the target hardware.

2. NPU epoch assignment

Analyses every layer and assigns it to either an EC epoch (executed on the Neural-ART NPU hardware) or a SW epoch (executed on the Cortex-M55 CPU) based on NPU hardware support.

3. Memory allocation

Places weights in OctoFlash and assigns activation buffers to the correct SRAM banks (npuRAM4, npuRAM5) based on the .mpool memory configuration file for the STM32N6570-DK.

4. C code generation

Produces network.c (layer execution schedule), network_data_params.c (weight arrays in flash), and all supporting headers — ready to compile with STM32CubeIDE.

How it is invoked — the actual command

You never call ST Edge AI Core manually — common_deploy.py calls it automatically during deployment. But understanding the command it runs helps make sense of the output. Here is the exact command used for MoveNet:

stedgeai generate \ --target stm32n6 \ -m st_movenet_lightning_heatmaps_192_int8_pc.tflite \ --st-neural-art default@user_neuralart_STM32N6570-DK.json \ --input-data-type uint8 \ --inputs-ch-position chlast \ --load-mpool stm32n6-app2_STM32N6570-DK.mpool \ --output experiments_outputs/2025_05_27_14_46_07/ \ -O3 --Oalt-sched --Ocache-opt --Os \ --enable-epoch-controller
Argument Meaning
--target stm32n6 Target the STM32N6 NPU backend (Neural-ART)
--st-neural-art Loads the NPU configuration JSON for the STM32N6570-DK board
--load-mpool Memory pool config: tells the compiler which SRAM banks to use and at which addresses
--input-data-type uint8 Camera frames arrive as uint8 — no CPU dequantization needed at input
--enable-epoch-controller Generates the epoch schedule that the NPU runtime uses to interleave EC and SW epochs
-O3 --Oalt-sched --Ocache-opt Full optimisation: alternative scheduling, cache-aware layout, minimum code size

Epoch breakdown — MoveNet Lightning (real data)

The following is extracted directly from the network_generate_report.txt produced by ST Edge AI Core v2.1.0 when we deployed MoveNet Lightning (192×192, INT8, 13 keypoints). 75 total epochs: 71 EC (NPU) and 4 SW (CPU).

Epochs 1-58 → EC (NPU) 71 EC epochs (94.7%) in green • 4 SW epochs in red (59, 63, 67 = Resize; 74 = DequantizeLinear)
The 4 SW epochs — why they fall back to CPU
Epoch Operation Why on CPU?
epoch 59 Resize (bilinear) Bilinear interpolation requires floating-point position arithmetic. The Neural-ART NPU has no bilinear resize unit.
epoch 63 Resize (bilinear) Same — second decoder upsample stage.
epoch 67 Resize (bilinear) Same — third decoder upsample stage.
epoch 74 DequantizeLinear Converts INT8 output heatmaps to float32 for the C postprocessor. A one-time scale+offset operation at the very end of the network.

Key insight: these 4 SW epochs are not in the backbone — they are all in the decoder (upsampling) and output stage. The entire MobileNetV2 backbone (epochs 1–58) runs 100% on the NPU. This is why MoveNet achieves 94.7% offload despite having CPU fallback operations.

Memory allocation report — real data

ST Edge AI Core places every buffer at a specific physical address. The following is the exact memory layout generated for MoveNet:

Region Address range Used Available Content
cpuRAM2 0x34100000–0x34200000 864 KB 1 MB CPU activations (SW epoch intermediate buffers)
npuRAM4 0x34270000–0x342E0000 378 KB 448 KB nn_in: camera frame 192×192×3 input to NPU
npuRAM5 0x342E0000–0x34350000 432 KB 448 KB nn_out: heatmaps 48×48×13 output from NPU
octoFlash 0x70380000–0x7066C810 2.924 MB 61 MB Weights: 2,355,908 bytes of INT8 parameters
npuRAM3, npuRAM6 0 B 448 KB each Not used by MoveNet — available for other models
hyperRAM 0x90000000 0 B 16 MB Not used by NPU — used by LCD framebuffers
Total footprint — MoveNet Lightning 192×192 INT8: weights 2.924 MB (OctoFlash) + activations 1.635 MB (SRAM) = 4.559 MB total. Flash code: 13,342 bytes RT + 330,597 bytes rodata.

Layer-by-layer report — first 10 layers (real data)

The full report contains 87 layer entries. Here are the first 10 to show the format and what it reveals. The model starts with a QUANTIZE conversion layer, then proceeds through the MobileNetV2 backbone with alternating CONV_2D and DEPTHWISE_CONV_2D blocks.

# From network_generate_report.txt — ST Edge AI Core v2.1.0
# model: st_movenet_lightning_heatmaps_192_int8_pc.tflite

m_id layer oshape param/size macc
──────────────────────────────────────────────────────────────────
0     serving_default_input [1,192,192,3]
     conversion_0 (QUANTIZE) [1,192,192,3] 221,184
──────────────────────────────────────────────────────────────────
1     conv2d_1 (CONV_2D) [1,96,96,32] 896/992 7,962,656
     nl_1_nl (CONV_2D) [1,96,96,32] 294,912
──────────────────────────────────────────────────────────────────
2     conv2d_2 (DEPTHWISE) [1,96,96,32] 320/416 2,654,240
     nl_2_nl (DEPTHWISE) [1,96,96,32] 294,912
──────────────────────────────────────────────────────────────────
3     conv2d_3 (CONV_2D) [1,96,96,16] 528/576 4,718,608
──────────────────────────────────────────────────────────────────
... (83 more layers — all EC except epochs 59, 63, 67, 74)
──────────────────────────────────────────────────────────────────
85   conv2d_85 (CONV_2D) [1,48,48,13] 1,261/1,300 2,875,405
86   nl_86 (LOGISTIC/Sigmoid) [1,48,48,13] 299,520
87   conversion_87 (DEQUANTIZE)[1,48,48,13] 59,904 ← SW epoch 74
──────────────────────────────────────────────────────────────────
Total: macc=242,937,509 weights=2,355,908 activations=1,635,000
Reading the table: m_id is the layer index; oshape is the output tensor shape [batch, height, width, channels]; param/size is weights count / bytes (INT8); macc is the number of multiply-accumulate operations — the dominant cost of convolution. The total of 242.9 million MACs completes in 22 ms on the NPU (≈11 GOPS effective throughput, well below the 600 GOPS peak due to memory bandwidth limits).

What ST Edge AI Core produces — generated files

After a successful run, the output directory contains two categories of files: the C source files that will be compiled into the firmware, and the analysis reports that document what was generated.

C source files (compiled into firmware)
generated/
├── network.c         ← epoch schedule
├── network_data.c    ← weight arrays
├── network_data_params.c
├── Inc/network.h
├── Inc/network_data.h
├── Inc/layers.h
├── Inc/ai_lite.h
├── Inc/ai_lite_interface.h
└── Inc/core_datatypes.h
    (+ 16 more header files)
Analysis reports (for documentation)
output_dir/
├── network_generate_report.txt
    ← layers + epochs + memory
├── network_c_info.json
    ← per-node cycle estimates
├── stm32ai_main.log
├── *_OE_3_2_0.onnx   ← optimised graph
└── *_Q.json         ← quantization params

Inside network.c — what the generated code actually looks like

network.c is 5,882 lines of auto-generated C. You will never edit it — but understanding its three-part structure explains exactly how the NPU executes your model at runtime. This is the level of detail that distinguishes a deployment you truly understand from one you just ran a script for.

Part 1 — Memory pool declarations (the comment header)

The first ~30 lines are comments documenting every memory pool the compiler considered and its decision. Each pool has a score — the compiler's fitness metric — and the pool with the highest score for each tensor wins:

/* global pool 8 is 2.92 MB */
/* name=octoFlash offset=0x70380000 READ_ONLY LATENCY=HIGH cacheable=ON score=50 */

/* global pool 1 is 432.00 KB */
/* name=npuRAM5 offset=0x342e0000 THROUGHPUT=HIGH LATENCY=LOW cacheable=OFF score=94 */

/* global pool 2 is 378.00 KB */
/* name=npuRAM4 offset=0x34270000 THROUGHPUT=HIGH LATENCY=LOW cacheable=OFF score=94 */

npuRAM4/5 score 94 — fast SRAM directly connected to the NPU, zero cache overhead. OctoFlash scores 50 — high latency, but the only option for 2.9 MB of weights. The compiler places activations in npuRAM and weights in flash accordingly.

Part 2 — EC epoch blobs (network_ecblobs.h)

This is the most unusual part. The NPU does not execute layers through C function calls. Instead, each EC epoch is encoded as a binary blob — a sequence of 64-bit words that the NPU epoch controller interprets as microcode, loading them directly into hardware registers:

/* Epoch 1 — first Conv2D block, MobileNetV2 stem */
static const uint64_t _ec_blob_1 [] =
{
  0x0000b686ca057a7aUL, /* CONVACC unit config */
  0x5c00004200802241UL, /* weight address + stride */
  0x0001b000342e0000UL, /* activation buffer (npuRAM5) */
  ... /* 71 blobs total, one per EC epoch */
};

Each blob encodes the complete configuration for one NPU epoch: CONVACC unit selection, weight buffer address in OctoFlash, activation buffer in npuRAM, kernel size, stride, padding. The epoch controller loads these directly into NPU registers with zero CPU involvement — this is why EC epochs are orders of magnitude faster than SW epochs.

Part 3 — SW epoch C function calls

For the 4 SW epochs, the compiler generates standard C calls into the ll_sw library, executed by the Cortex-M55 with Helium SIMD (enabled by the --mvei flag in the compiler command):

/* Epochs 59, 63, 67 — Resize bilinear (decoder upsample) */
ll_sw_resize_bilinear(input_ptr, output_ptr,
  in_h, in_w, out_h, out_w, channels);

/* Epoch 74 — DequantizeLinear (INT8 → float32 for postprocessor) */
ll_sw_dequantize(int8_ptr, float32_ptr, scale, zero_point, size);

After epoch 74, the output is a float32 heatmap tensor in npuRAM5 — ready for the C firmware postprocessor in display_spe.c to decode the argmax coordinates and draw the skeleton on the LCD.

Connection to Part 2 — your handwritten C firmware: network.c is auto-generated and not in the stm32_docs source tree. The handwritten firmware calls LL_ATON_RT_Main() in main.c, which triggers the epoch controller to execute all 75 blobs in sequence. The only generated file your firmware includes directly is app_config.h — documented below and in Part 2.

The C_header — app_config.h (real file)

Separately from the network C files, gen_h_file.py generates app_config.h — the configuration header that tells the C firmware everything it needs to know about the model. Here is the actual file generated for our MoveNet deployment:

/* Generated by gen_h_file.py — do not edit manually */

#define POSTPROCESS_TYPE  POSTPROCESS_SPE_MOVENET_UF
#define NN_HEIGHT          (192)
#define NN_WIDTH           (192)
#define NN_BPP             3
#define COLOR_MODE         COLOR_RGB
#define ASPECT_RATIO_MODE ASPECT_RATIO_CROP

/* Post-processing */
#define AI_POSE_PP_CONF_THRESHOLD (0.4)
#define AI_POSE_PP_POSE_KEYPOINTS_NB (13)
#define AI_SPE_MOVENET_POSTPROC_HEATMAP_WIDTH (NN_WIDTH/4)
#define AI_SPE_MOVENET_POSTPROC_HEATMAP_HEIGHT (NN_HEIGHT/4)
#define AI_SPE_MOVENET_POSTPROC_NB_KEYPOINTS (13)
Define Value Used by
POSTPROCESS_TYPE POSTPROCESS_SPE_MOVENET_UF Selects the argmax heatmap decoder in display_spe.c
NN_HEIGHT / NN_WIDTH 192, 192 DCMIPP crop window and NPU input buffer sizing
AI_POSE_PP_CONF_THRESHOLD 0.4 Keypoints below 40% confidence are not drawn
AI_POSE_PP_POSE_KEYPOINTS_NB 13 Selects the 13-keypoint skeleton (no head landmarks)
HEATMAP_WIDTH/HEIGHT 48, 48 (=192/4) Output heatmap resolution — argmax decoded per channel
Where ST Edge AI Core fits in the full pipeline
Zoo: st_movenet_lightning_heatmaps_192_int8_pc.tflite
  ↓ common_deploy.py calls stedgeai generate
ST Edge AI Core: graph opt → epoch assignment → memory alloc → C gen
  ↓ outputs to experiments_outputs/
network.c + network_data.c + app_config.h + network_generate_report.txt
  ↓ STM32CubeIDE compiles and flashes
STM32N6570-DK: 22 ms inference, 94.7% NPU offload
← 3.2 ModelZoo Services Next: 3.4 STM32CubeIDE →