EulerNPU – NPU Inference Composition & Simulation Stack

Demo — Project 3: Embodied Multimodal AI SoC

A real terminal demo of a humanoid/robot-brain SoC running inference in the EulerNPU simulator

The reference chip Project 3 — Embodied Multimodal AI SoC is a robot-brain chip that fuses vision + LiDAR + language + proprioceptive state in parallel on a single NPU and generates motor actions with a Diffusion Policy (16-step horizon · 14-DOF). It exercises all 138 operators in a single pipeline and applies INT4 MoE weights. The demo walks through a 5-stage subsystem compile log → parallel pipeline execution → results table → 14-joint trajectory → FPGA-readiness checklist, then runs the pipeline twice to confirm bit-exact determinism.

Target	Zynq UltraScale+ ZU9EG FPGA @ 300 MHz
Budget	Latency ≤ 150 ms · Power ≤ 3 W (INT4)
Subsystems	Vision (ViT) · LiDAR BEV (PointPillar) · Language (GQA) · State (Mamba SSM) · Action (MoE + Diffusion Policy)
Verification	cpu_ref ↔ npu_sim numerical match (L∞ < 1e-4) · bit-exact determinism

※ Project 3 is a scaled-down, synthetic reference model for validating 138-operator coverage and FPGA readiness. The figures are stated against the ZU9EG budget (≤150 ms / ≤3 W).

Core Features

138 operators, 10 DTypes, from spec.yaml to FPGA inference

138 operators (17 groups, A–Q)

The operators needed for NPU inference, organized into 17 groups. Spec 0.5.0 adds the Efficient Attention, Vision Encoder, Diffusion, and Speculative Decoding groups.

Core Math	MatMul, Add, Mul, Div, Sqrt and other basic math ops
Activation	ReLU, GELU, SiLU, Sigmoid, Softmax, etc.
Normalization	LayerNorm, RMSNorm, BatchNorm, GroupNorm
Conv/Vision	Conv2D, DepthwiseConv, Pool, Resize, Patch
Sequence/Attention	ScaledDotProduct, MultiHeadAttention, RoPE, ALiBi
Efficient Attention NEW	FlashAttention, SlidingWindowAttention, MultiQueryAttention (GQA)
MoE/Sparse	TopKRouter, ExpertDispatch, LoadBalanceLoss
Recurrent	LSTM, GRU, SRU
Graph	Concat, Split, Reshape, Transpose, Gather, Scatter
Multimodal	CrossAttention, VisionProjection, AudioMel
Vision Encoder NEW	PatchEmbed, ClsTokenPrepend, ImageNorm
Diffusion NEW	TimestepEmbed, NoiseSample, DDIMStep, CFGScale, FlowMatchStep
Speculative Decoding NEW	TokenAcceptance, DraftVerify, PrefixCacheLookup/Store
Quantization	Quantize, Dequantize, FakeQuantize, PackInt4/UnpackInt4
Mamba/SSM	SelectiveScan, Discretize, SSMConv
Cache Compress	KVCacheCompress, SlidingWindow, H2O
Autonomy	PointCloud, BEVProject, TrajectoryPredict

10-DType System

Classified into three tiers by precision and performance requirements.

Tier 0 (required)	fp32, int32 — supported by every operator
Tier 1 (recommended)	fp16, bf16, int8, uint8 — supported by most operators
Tier 2 (extended)	int16, int4, fp8_e4m3, fp8_e5m2 — specific operators

Execution backends (4)

cpu_ref	Host NumPy reference (runs instantly, no dependencies)
npu_sim	Functional simulation + execution trace + per-operator cycle/MAC/latency estimates
zynq_ps	Zynq ARM PS execution
zynq_pl_stub	FPGA PL offload analysis/emulation

FPGA board profiles

Zynq-7000	XC7Z020, AXI-Lite MMIO transport
Zynq UltraScale+	ZU3EG, ZU9EG (INT4 / high-performance target)

Design Principles

Deterministic, reproducible, and auditable inference at every step

Declarative Specification

All inference graphs are defined in spec.yaml — human-readable, version-controllable, and diffable. No hidden state or implicit configuration.

Bit-Exact Reproducibility

Simulation results are bit-exact across runs. The same spec.yaml always produces the same .npuart artifact and the same inference outputs.

Hardware-First Validation

Board-smoke tests verify hardware compatibility before deployment. Calibration and profiling ensure real-world performance matches simulation.

Compilation Pipeline

A 4-stage pipeline from spec.yaml to FPGA inference

Pipeline flow

spec.yaml (operator graph definition) | v [1] Validator --- operator/dtype/shape checks, graph integrity | v [2] Compiler --- operator fusion, memory layout, scheduling | v [3] .npuart --- serialized execution artifact (operators + weights + metadata) | v [4] Runtime --- CPU reference or Zynq-7000 / UltraScale+ FPGA execution

FPGA deployment pipeline

Step 1	Write spec.yaml and check it with `eulernpu validate`
Step 2	Generate the .npuart artifact with `eulernpu compile`
Step 3	Run a cycle-accurate host simulation with `eulernpu sim`
Step 4	Verify the FPGA board connection with `eulernpu board smoke`, then run it with `eulernpu run`

Additional tools

calibrate	Collect quantization calibration data
compress-cache	Apply KV-cache compression settings
benchmark	Latency/throughput benchmarks

CLI Reference

Single entry point eulernpu — 15 subcommands cover the entire workflow (--lang ko|en|zh|ja|es supported)

Command	Description
`eulernpu info`	Show platform, supported operators, and dtype information
`eulernpu validate`	Validate the spec.yaml operator graph (JSON-Schema + 23 semantic rules)
`eulernpu migrate-spec` NEW	Auto-migrate specs from 0.4 → 0.5
`eulernpu compile`	Compile spec.yaml into a .npuart artifact
`eulernpu run`	Run a .npuart artifact on the cpu_ref/npu_sim/zynq backends
`eulernpu sim`	Functional simulation + cycle/MAC/latency estimates
`eulernpu generate` NEW	Autoregressive token generation (KV cache)
`eulernpu quantize` NEW	INT8/INT4 weight quantization (`--weight-bits 4`)
`eulernpu profile`	Profile per-operator execution time and memory usage
`eulernpu explain`	Visualize the PL offload + memory plan and graph schedule
`eulernpu board smoke`	Verify FPGA board connectivity and basic operation
`eulernpu calibrate`	Collect and apply quantization calibration data
`eulernpu benchmark`	Run latency/throughput benchmarks
`eulernpu replay`	Replay a saved execution trace
`eulernpu compress-cache`	Apply and validate KV-cache compression settings

Tutorials

Step-by-step guides to get started with EulerNPU quickly

Tutorials coming soon.

Installation & Getting Started

Install EulerNPU and compile your first inference graph

Installation

pip install -e ".[dev]"

# Validate and compile
eulernpu validate spec.yaml
eulernpu compile spec.yaml -o model.npuart

Requirements

Python 3.10+, NumPy

Optional: ONNX import, Zynq-7000 / UltraScale+ boards (FPGA target)

GitHub

eulerwa/eulernpu

Start NPU Inference Development with EulerNPU

From spec.yaml to hardware deployment, in a single CLI.

Get Started on GitHub Contact Us