3.3 — ST Edge AI Core

The Python-to-C Bridge

ST Edge AI Core is the converter tool that takes a quantized neural network model (.tflite, .onnx) and produces optimised C source files, assigns each layer to either the NPU or the CPU, and generates a detailed profiling report. This is where Python ends and C begins.

v2.1.0 used in this project

75 epochs • 71 EC • 4 SW

2.924 MB weights • OctoFlash

What is ST Edge AI Core?

A neural network model file — even a quantized INT8 .tflite — cannot run directly on a microcontroller. It is a portable format that needs to be compiled into something a microcontroller can actually execute. ST Edge AI Core is the tool that performs this translation.

It does four things in one pass:

1. Graph optimization

Fuses adjacent operations (Conv + BatchNorm + ReLU into a single kernel), removes redundant transposes, and reformats tensor layouts for the target hardware.

2. NPU epoch assignment

Analyses every layer and assigns it to either an EC epoch (executed on the Neural-ART NPU hardware) or a SW epoch (executed on the Cortex-M55 CPU) based on NPU hardware support.

3. Memory allocation

Places weights in OctoFlash and assigns activation buffers to the correct SRAM banks (npuRAM4, npuRAM5) based on the .mpool memory configuration file for the STM32N6570-DK.

4. C code generation

Produces network.c (layer execution schedule), network_data_params.c (weight arrays in flash), and all supporting headers — ready to compile with STM32CubeIDE.

How it is invoked — the actual command

You never call ST Edge AI Core manually — common_deploy.py calls it automatically during deployment. But understanding the command it runs helps make sense of the output. Here is the exact command used for MoveNet:

stedgeai generate \
  --target stm32n6 \
  -m st_movenet_lightning_heatmaps_192_int8_pc.tflite \
  --st-neural-art default@user_neuralart_STM32N6570-DK.json \
  --input-data-type uint8 \
  --inputs-ch-position chlast \
  --load-mpool stm32n6-app2_STM32N6570-DK.mpool \
  --output experiments_outputs/2025_05_27_14_46_07/ \
  -O3 --Oalt-sched --Ocache-opt --Os \
  --enable-epoch-controller

Argument	Meaning
--target stm32n6	Target the STM32N6 NPU backend (Neural-ART)
--st-neural-art	Loads the NPU configuration JSON for the STM32N6570-DK board
--load-mpool	Memory pool config: tells the compiler which SRAM banks to use and at which addresses
--input-data-type uint8	Camera frames arrive as uint8 — no CPU dequantization needed at input
--enable-epoch-controller	Generates the epoch schedule that the NPU runtime uses to interleave EC and SW epochs
-O3 --Oalt-sched --Ocache-opt	Full optimisation: alternative scheduling, cache-aware layout, minimum code size

Epoch breakdown — MoveNet Lightning (real data)

The following is extracted directly from the network_generate_report.txt produced by ST Edge AI Core v2.1.0 when we deployed MoveNet Lightning (192×192, INT8, 13 keypoints). 75 total epochs: 71 EC (NPU) and 4 SW (CPU).

The 4 SW epochs — why they fall back to CPU

Epoch	Operation	Why on CPU?
epoch 59	Resize (bilinear)	Bilinear interpolation requires floating-point position arithmetic. The Neural-ART NPU has no bilinear resize unit.
epoch 63	Resize (bilinear)	Same — second decoder upsample stage.
epoch 67	Resize (bilinear)	Same — third decoder upsample stage.
epoch 74	DequantizeLinear	Converts INT8 output heatmaps to float32 for the C postprocessor. A one-time scale+offset operation at the very end of the network.

Key insight: these 4 SW epochs are not in the backbone — they are all in the decoder (upsampling) and output stage. The entire MobileNetV2 backbone (epochs 1–58) runs 100% on the NPU. This is why MoveNet achieves 94.7% offload despite having CPU fallback operations.

Memory allocation report — real data

ST Edge AI Core places every buffer at a specific physical address. The following is the exact memory layout generated for MoveNet:

Region	Address range	Used	Available	Content
cpuRAM2	0x34100000–0x34200000	864 KB	1 MB	CPU activations (SW epoch intermediate buffers)
npuRAM4	0x34270000–0x342E0000	378 KB	448 KB	nn_in: camera frame 192×192×3 input to NPU
npuRAM5	0x342E0000–0x34350000	432 KB	448 KB	nn_out: heatmaps 48×48×13 output from NPU
octoFlash	0x70380000–0x7066C810	2.924 MB	61 MB	Weights: 2,355,908 bytes of INT8 parameters
npuRAM3, npuRAM6	—	0 B	448 KB each	Not used by MoveNet — available for other models
hyperRAM	0x90000000	0 B	16 MB	Not used by NPU — used by LCD framebuffers

Total footprint — MoveNet Lightning 192×192 INT8: weights 2.924 MB (OctoFlash) + activations 1.635 MB (SRAM) = 4.559 MB total. Flash code: 13,342 bytes RT + 330,597 bytes rodata.

Layer-by-layer report — first 10 layers (real data)

The full report contains 87 layer entries. Here are the first 10 to show the format and what it reveals. The model starts with a QUANTIZE conversion layer, then proceeds through the MobileNetV2 backbone with alternating CONV_2D and DEPTHWISE_CONV_2D blocks.

  # From network_generate_report.txt — ST Edge AI Core v2.1.0

  # model: st_movenet_lightning_heatmaps_192_int8_pc.tflite

  m_id  layer                     oshape              param/size    macc

  ──────────────────────────────────────────────────────────────────

  0     serving_default_input   [1,192,192,3]

       conversion_0 (QUANTIZE)  [1,192,192,3]                   221,184

  ──────────────────────────────────────────────────────────────────

  1     conv2d_1 (CONV_2D)        [1,96,96,32]   896/992     7,962,656

       nl_1_nl (CONV_2D)         [1,96,96,32]                   294,912

  ──────────────────────────────────────────────────────────────────

  2     conv2d_2 (DEPTHWISE)      [1,96,96,32]   320/416     2,654,240

       nl_2_nl (DEPTHWISE)       [1,96,96,32]                   294,912

  ──────────────────────────────────────────────────────────────────

  3     conv2d_3 (CONV_2D)        [1,96,96,16]   528/576     4,718,608

  ──────────────────────────────────────────────────────────────────

  ...  (83 more layers — all EC except epochs 59, 63, 67, 74)

  ──────────────────────────────────────────────────────────────────

  85   conv2d_85 (CONV_2D)       [1,48,48,13]   1,261/1,300  2,875,405

  86   nl_86 (LOGISTIC/Sigmoid)  [1,48,48,13]                  299,520

  87   conversion_87 (DEQUANTIZE)[1,48,48,13]                   59,904  ← SW epoch 74

  ──────────────────────────────────────────────────────────────────

  Total: macc=242,937,509  weights=2,355,908  activations=1,635,000

Reading the table: m_id is the layer index; oshape is the output tensor shape [batch, height, width, channels]; param/size is weights count / bytes (INT8); macc is the number of multiply-accumulate operations — the dominant cost of convolution. The total of 242.9 million MACs completes in 22 ms on the NPU (≈11 GOPS effective throughput, well below the 600 GOPS peak due to memory bandwidth limits).

What ST Edge AI Core produces — generated files

After a successful run, the output directory contains two categories of files: the C source files that will be compiled into the firmware, and the analysis reports that document what was generated.

C source files (compiled into firmware)

      generated/

      ├── network.c
              ← epoch schedule

      ├── network_data.c    ← weight arrays

      ├── network_data_params.c

      ├── Inc/network.h

      ├── Inc/network_data.h

      ├── Inc/layers.h

      ├── Inc/ai_lite.h

      ├── Inc/ai_lite_interface.h

      └── Inc/core_datatypes.h

          (+ 16 more header files)

Analysis reports (for documentation)

      output_dir/

      ├── network_generate_report.txt

          ← layers + epochs + memory

      ├── network_c_info.json

          ← per-node cycle estimates

      ├── stm32ai_main.log

      ├── *_OE_3_2_0.onnx   ← optimised graph

      └── *_Q.json         ← quantization params

Inside network.c — what the generated code actually looks like

network.c is 5,882 lines of auto-generated C. You will never edit it — but understanding its three-part structure explains exactly how the NPU executes your model at runtime. This is the level of detail that distinguishes a deployment you truly understand from one you just ran a script for.

Part 1 — Memory pool declarations (the comment header)

The first ~30 lines are comments documenting every memory pool the compiler considered and its decision. Each pool has a score — the compiler's fitness metric — and the pool with the highest score for each tensor wins:

    /* global pool 8 is 2.92 MB */

    /* name=octoFlash
    offset=0x70380000
    READ_ONLY LATENCY=HIGH cacheable=ON score=50 */


    /* global pool 1 is 432.00 KB */

    /* name=npuRAM5
    offset=0x342e0000
    THROUGHPUT=HIGH LATENCY=LOW cacheable=OFF score=94 */


    /* global pool 2 is 378.00 KB */

    /* name=npuRAM4
    offset=0x34270000
    THROUGHPUT=HIGH LATENCY=LOW cacheable=OFF score=94 */
  

npuRAM4/5 score 94 — fast SRAM directly connected to the NPU, zero cache overhead. OctoFlash scores 50 — high latency, but the only option for 2.9 MB of weights. The compiler places activations in npuRAM and weights in flash accordingly.

Part 2 — EC epoch blobs (network_ecblobs.h)

This is the most unusual part. The NPU does not execute layers through C function calls. Instead, each EC epoch is encoded as a binary blob — a sequence of 64-bit words that the NPU epoch controller interprets as microcode, loading them directly into hardware registers:

    /* Epoch 1 — first Conv2D block, MobileNetV2 stem */

    static const uint64_t _ec_blob_1 [] =

    {

      0x0000b686ca057a7aUL, /* CONVACC unit config */

      0x5c00004200802241UL, /* weight address + stride */

      0x0001b000342e0000UL, /* activation buffer (npuRAM5) */

      ... /* 71 blobs total, one per EC epoch */

    };

Each blob encodes the complete configuration for one NPU epoch: CONVACC unit selection, weight buffer address in OctoFlash, activation buffer in npuRAM, kernel size, stride, padding. The epoch controller loads these directly into NPU registers with zero CPU involvement — this is why EC epochs are orders of magnitude faster than SW epochs.

Part 3 — SW epoch C function calls

For the 4 SW epochs, the compiler generates standard C calls into the ll_sw library, executed by the Cortex-M55 with Helium SIMD (enabled by the --mvei flag in the compiler command):

    /* Epochs 59, 63, 67 — Resize bilinear (decoder upsample) */

    ll_sw_resize_bilinear(input_ptr, output_ptr,

      in_h, in_w, out_h, out_w, channels);

    /* Epoch 74 — DequantizeLinear (INT8 → float32 for postprocessor) */

    ll_sw_dequantize(int8_ptr, float32_ptr, scale, zero_point, size);

After epoch 74, the output is a float32 heatmap tensor in npuRAM5 — ready for the C firmware postprocessor in display_spe.c to decode the argmax coordinates and draw the skeleton on the LCD.

Connection to Part 2 — your handwritten C firmware: network.c is auto-generated and not in the stm32_docs source tree. The handwritten firmware calls LL_ATON_RT_Main() in main.c, which triggers the epoch controller to execute all 75 blobs in sequence. The only generated file your firmware includes directly is app_config.h — documented below and in Part 2.

The C_header — app_config.h (real file)

Separately from the network C files, gen_h_file.py generates app_config.h — the configuration header that tells the C firmware everything it needs to know about the model. Here is the actual file generated for our MoveNet deployment:

  /* Generated by gen_h_file.py — do not edit manually */

  #define POSTPROCESS_TYPE
   POSTPROCESS_SPE_MOVENET_UF

  #define NN_HEIGHT
           (192)

  #define NN_WIDTH
            (192)

  #define NN_BPP
              3

  #define COLOR_MODE
          COLOR_RGB

  #define ASPECT_RATIO_MODE  ASPECT_RATIO_CROP

  /* Post-processing */

  #define AI_POSE_PP_CONF_THRESHOLD  (0.4)

  #define AI_POSE_PP_POSE_KEYPOINTS_NB  (13)

  #define AI_SPE_MOVENET_POSTPROC_HEATMAP_WIDTH  (NN_WIDTH/4)

  #define AI_SPE_MOVENET_POSTPROC_HEATMAP_HEIGHT (NN_HEIGHT/4)

  #define AI_SPE_MOVENET_POSTPROC_NB_KEYPOINTS  (13)

Define	Value	Used by
POSTPROCESS_TYPE	POSTPROCESS_SPE_MOVENET_UF	Selects the argmax heatmap decoder in `display_spe.c`
NN_HEIGHT / NN_WIDTH	192, 192	DCMIPP crop window and NPU input buffer sizing
AI_POSE_PP_CONF_THRESHOLD	0.4	Keypoints below 40% confidence are not drawn
AI_POSE_PP_POSE_KEYPOINTS_NB	13	Selects the 13-keypoint skeleton (no head landmarks)
HEATMAP_WIDTH/HEIGHT	48, 48 (=192/4)	Output heatmap resolution — argmax decoded per channel

Where ST Edge AI Core fits in the full pipeline

    Zoo: st_movenet_lightning_heatmaps_192_int8_pc.tflite

      ↓ common_deploy.py calls stedgeai generate

    ST Edge AI Core: graph opt → epoch assignment → memory alloc → C gen

      ↓ outputs to experiments_outputs/

    network.c + network_data.c
    + app_config.h
    + network_generate_report.txt

      ↓ STM32CubeIDE compiles and flashes

    STM32N6570-DK: 22 ms inference, 94.7% NPU offload

← 3.2 ModelZoo Services Next: 3.4 STM32CubeIDE →