Chapter 5 — Case Studies

Three Models, One Board

Detailed profiling data for each deployment — real numbers extracted directly from the ST Edge AI Core reports. Each model tells a different story about what the Neural-ART NPU can and cannot accelerate.

MoveNet • CNN • 94.7% NPU

YOLOv8n • CNN+Head • 87.9% NPU

TinyBERT • Transformer • 64.4% NPU

Case Study 1

MoveNet Lightning — The CNN Benchmark

Architecture & why we chose it

MoveNet Lightning is a single-person pose estimation model developed by Google, based on a MobileNetV2 backbone with a Feature Pyramid Network decoder. It was designed specifically for real-time inference on edge devices — making it the natural baseline for our CNN benchmark. The ST variant (st_movenet_lightning_heatmaps_192_int8_pc) was retrained by STMicroelectronics on a custom COCO subset and pre-quantized to INT8 per-channel — meaning it required no quantization step from us, only deployment.

The model outputs heatmaps rather than direct coordinates: a (48×48×13) tensor where each of the 13 channels is a probability map for one keypoint. The C firmware postprocessor in display_spe.c decodes these by finding the argmax of each channel and converting to screen coordinates.

Profiling data — real numbers from ST Edge AI Core v2.1.0

Epoch breakdown

71 EC (94.7%)

4 SW (5.3%)

Total: 75 epochs
SW epochs 59, 63, 67: Resize bilinear
SW epoch 74: DequantizeLinear
All 4 SW in decoder — backbone 100% NPU

Memory allocation

cpuRAM2	864 KB	84%
npuRAM4	378 KB	84%
npuRAM5	432 KB	96%
octoFlash	2.924 MB	5%
Total	4.559 MB

22 ms

Inference time

94.7%

NPU offload

57.6%

OKS (192×192)

242M

MACs total

Why 94.7% and not 100%: The 4 SW epochs are all in the decoder — not the backbone. The entire MobileNetV2 feature extractor (epochs 1–58) runs 100% on the NPU. Bilinear resize and dequantization fall back to the Cortex-M55 + Helium SIMD. This is the best possible result for a CNN on this hardware.

Case Study 2

YOLOv8n-pose — CNN with Detection Head

Architecture & why it is harder than MoveNet

YOLOv8n-pose is a multi-person pose estimation model from Ultralytics — it detects all people in the frame simultaneously and outputs bounding boxes with 17 COCO keypoints per person. Unlike MoveNet, it has two distinct parts: a CNN backbone (CSPDarknet + PAN-FPN) that runs almost entirely on the NPU, and a detection head that includes reshape, transpose, softmax, and NMS operations — many of which are not supported by the NPU.

This model was not in the ST Model Zoo — it required manual quantization via chain_qd. The float PyTorch model was exported to .tflite and quantized with Post-Training Quantization using a COCO calibration subset.

Epoch breakdown

131 EC (87.9%)

18 SW (12.1%)

Total: 149 epochs
SW epochs (head): Reshape, Transpose, Softmax, Slice, Dequantize
Backbone 100% NPU — head partially on CPU

Results

Inference time	32 ms
NPU offload	87.9%
Keypoints	17 (COCO)
Multi-person	Yes
NMS on	CPU (SW epoch)

Why 87.9% and not 94.7%: The 7% gap comes from the detection head. YOLOv8's head applies Non-Maximum Suppression (NMS) — a data-dependent operation where the number of detections varies per frame. The NPU epoch controller requires a fixed, predictable execution schedule, so NMS and the associated reshape/transpose operations must fall back to the CPU. The backbone (all convolutions) is still 100% on the NPU.

Case Study 3

TinyBERT — The Transformer Challenge

Architecture & why it is fundamentally different

TinyBERT is a compressed version of BERT — a Transformer-based language model. We deployed it as a proof-of-concept to test the limits of the Neural-ART NPU: what happens when you run a Transformer, not a CNN, on an NPU designed for CNNs?

The answer is visible in the epoch breakdown: 96 out of 270 epochs fall back to the CPU. The reason is architectural — Transformer blocks contain operations that are fundamentally incompatible with the CONVACC-centric design of the Neural-ART NPU.

SW epoch analysis — what falls back and why (real data)

The following is extracted directly from the network_generate_report.txt of the TinyBERT deployment (run 2025_07_16_22_43_20). Each Transformer block generates the same pattern of SW epochs — repeated 4 times (one per BERT layer):

  # Pattern repeated per Transformer block (4 blocks total):

  epoch N    EC   # Q/K/V linear projection (Conv) — NPU

  epoch N+1  -SW- Transpose        
  # Q·Kᵀ requires reshape before MatMul

  epoch N+3  EC   # MatMul partial — NPU

  epoch N+4  -SW- Conv ×4          
  # attention heads: non-standard conv pattern

  epoch N+9  -SW- Softmax          
  # attention weights — no NPU Softmax unit

  epoch N+10 -SW- Split            
  # multi-head split — data-dependent

  epoch N+11 EC   # value projection — NPU

  epoch N+12 -SW- Conv ×4          
  # output projection: non-standard

  ...

  epoch N+K  -SW- DequantizeLinear  
  # LayerNorm requires float32

  epoch N+K+1 -SW- Reciprocal        
  # 1/std for LayerNorm normalisation

  epoch N+K+2 -SW- QuantizeLinear     
  # re-quantize after LayerNorm

  epoch N+K+3 -SW- Mul               
  # scale step of LayerNorm

  epoch N+K+4 -SW- QuantizeLinear     
  # second re-quantize

  epoch N+K+5 -SW- Sub               
  # mean subtraction of LayerNorm

Epochs are listed in compiler-emission order, which does not match the mathematical LayerNorm formula y = γ·(x−μ)/σ + β: kernel fusion and quantization boundaries reorder the primitive ops (Sub, Mul, Reciprocal) so the sequence emitted to the NPU scheduler no longer mirrors the formula.

Epoch breakdown

174 EC (64.4%)

96 SW (35.6%)

Total: 270 epochs
SW types: Softmax, Transpose, Split, Conv (attention), DequantizeLinear, Reciprocal, QuantizeLinear, Mul, Sub
These are the core Transformer ops — not post-processing

Memory allocation

cpuRAM2	0 KB	0%
npuRAM5	251 KB	56%
octoFlash	2.062 MB	3%
Total	2.308 MB

Smaller than MoveNet despite more epochs — BERT is parameter-efficient per layer. cpuRAM2 = 0 because SW epochs use npuRAM5 for all intermediate buffers.

>100 ms

Inference time

real-time impossible

64.4%

NPU offload

35.6% on CPU

270

Total epochs

3.6× more than MoveNet

Why 64.4% and not 87.9%: This is a structural incompatibility, not a tuning gap. LayerNormalization requires computing the mean and variance of a sequence, then applying a learned scale and shift. This involves Dequantize → Reciprocal → Mul → Sub — a chain of 6 SW epochs per layer, repeated 4 times. Softmax requires an exponential followed by a normalisation — also not in the NPU instruction set. These are not peripheral operations like bilinear resize in MoveNet — they are the core of every Transformer block. No amount of compiler optimisation can map them to CONVACC.

Cross-model comparison

Metric	MoveNet	YOLOv8n	TinyBERT
Architecture	CNN (MobileNetV2)	CNN + Det. Head	Transformer (BERT)
Total epochs	75	149	270
EC epochs (NPU)	71	131	174
SW epochs (CPU)	4	18	96
NPU offload %	94.7%	87.9%	64.4%
Weights (OctoFlash)	2.924 MB	~3.2 MB	2.062 MB
Activations (SRAM)	1.635 MB	~1.8 MB	251 KB
Inference time	22 ms	32 ms	>100 ms
SW epoch cause	Resize, Dequantize	NMS, Reshape	Softmax, LayerNorm, Attention

← Chapter 4 — Deployment Workflow Next: Chapter 6 — Results →