Skip to content

KULeuven-MICAS/stream

Repository files navigation

🌊 Stream

Ruff Python 3.12+ Docs

Stream is a design space exploration (DSE) and constraint-optimization framework for heterogeneous dataflow accelerators: accelerator systems built by combining cores that each have their own dataflow and performance model (AIE and TPU-like are two example core types among others). Scheduling is layer-fused, and the TETRA constraint optimization uses MILP (Mixed-Integer Linear Programming) to decide tensor placement and transfer paths across the cores of such a system. Stream builds on top of ZigZag for per-core cost estimation.


✨ Key Features

Heterogeneous dataflow cores: compose an accelerator from cores that each carry their own dataflow and cost model (AIE, TPU-like, pooling, SIMD, and more).

Layer-fused scheduling across the whole system of cores.

TETRA constraint optimization: a MILP (TransferAndTensorAllocator) decides tensor placement and transfer-path routing.

Pluggable solver backends: OR-Tools GSCIP (default, license-free), OR-Tools HiGHS, and Gurobi behind one unified SolverModel API.

ONNX workloads with auto-generated or hand-written mappings.

AMD AIE code generation: emit aie / aiex MLIR for the Ryzen AI NPU, ready for the mlir-aie / IRON toolchain.

Built for AI agents: an MCP server and typed IR models expose the pipeline programmatically.

The pipeline runs as a chain of stages: parse → tile → cost → MILP allocation → memory estimation.


🚀 Installation

Python >=3.12 is required.

Full install with MCP server support (from the repo root):

pip install -e ".[mcp]"

Base install (no MCP server):

pip install -e .

The authoritative dependency source is pyproject.toml (package stream-dse). The base install pulls in zigzag-dse, ortools>=9.15 (the default, license-free MILP backend), pydantic, pydot, and xdsl. Optional extras: [mcp] adds fastmcp (required for the MCP server); [gurobi] adds gurobipy (commercial solver, opt-in).

AIE code generation

AIE-target MLIR codegen and tracing additionally need the AMD AIE toolchain (mlir_aie, llvm-aie, xdsl-aie, snax-mlir, aie-python-extras). These are git/URL installs that PyPI does not allow in package metadata, so a console script installs them after the base install rather than via an extra:

pip install -e .       # or, once published: pip install stream-dse
stream-setup-aie       # installs the AIE toolchain into the current environment

stream-setup-aie --dry-run prints exactly what it will install without making changes.

⚠️ Platform caveat: the AIE toolchain is Linux x86_64 only (manylinux wheels), CPython 3.12 or 3.13.

💡 Solver license note: OR-Tools (ortools_gscip, the default backend) is open-source and needs no license. Gurobi requires the [gurobi] extra (pip install -e ".[gurobi]") plus a separate commercial license; backend="gurobi" errors at solve time without a valid license.

Optional pre-commit setup:

pre-commit install

⚡ Quick Start

Run the CO pipeline on a small two-Conv workload (a committed test fixture) with an auto-generated mapping (approximately 11 seconds):

python scripts/main_stream_co.py \
  --hardware stream/inputs/examples/hardware/tpu_like_quad_core.yaml \
  --workload stream/inputs/testing/workload/2conv_1_8_32_32_16_32_3.onnx

Or simply just co-2conv (this repo uses just as a task runner; it defaults to tpu_like_quad_core, see the matrix below). --mapping is omitted, so the mapping is auto-generated by the pipeline; the hardware is a TPU-like quad-core system.

Expected output:

Total latency: 14344.0
  Group 0: 14344 (100.0%, wall=9.4s)

A YAML summary is written to outputs/.../summary.yaml with total_latency: 14344.0, plus workload/tiling/schedule PNG visualizations.


🧩 Hardware and Core Types

An accelerator in Stream is described as a system of heterogeneous dataflow cores. Core roles include compute, memory, shim, and offchip; example dataflow core types include AIE, TPU-like, and pooling.

Hardware and mapping files are organized as follows:

  • stream/inputs/examples/hardware/ - system-level hardware YAMLs (e.g. tpu_like_quad_core.yaml, eyeriss_like_*.yaml, simba*.yaml, fusemax.yaml).
  • stream/inputs/examples/hardware/cores/ - per-core-type YAMLs (e.g. tpu_like.yaml, pooling.yaml, simd.yaml, offchip.yaml, eyeriss_like.yaml).
  • stream/inputs/aie/hardware/ and stream/inputs/aie/hardware/cores/ - AMD AIE example core types (e.g. aie_tile.yaml, mem_tile_256KB.yaml, shim_dma.yaml).
  • stream/inputs/examples/mapping/, stream/inputs/aie/mapping/, and stream/inputs/testing/mapping/ - mapping descriptions.

A mapping can be auto-generated (as in Quick Start above) or hand-written and passed via --mapping.


📊 Workload × Hardware Matrix

The generic CO pipeline runs any ONNX workload on any of the example hardware systems. The repo ships two small workloads and exercises them across all eight non-AIE example architectures, both from the scripts/main_stream_co.py entry point and from the pytest suite (tests/test_hardware_combinations.py).

Workloads - committed test fixtures under stream/inputs/testing/workload/ (weight values are cleared, only tensor shapes matter for cost estimation, so the ONNX stay tiny; just gen-workloads regenerates them via the builders):

  • 2-conv - two chained Conv layers (make_2_conv.py).
  • swiglu - a 5-node SwiGLU block: two Gemms, SiLU, an elementwise Mul, and a down-projection Gemm (make_swiglu.py).
Hardware (stream/inputs/examples/hardware/) Description 2-conv swiglu
eyeriss_like_single_core one Eyeriss-like compute core (+ pooling, SIMD, DRAM)
eyeriss_like_dual_core two Eyeriss-like compute cores
eyeriss_like_quad_core four Eyeriss-like compute cores
tpu_like_quad_core four TPU-like compute cores
simba_small small Simba chiplet mesh
simba 36-core Simba chiplet mesh
fusemax FuseMax array + vector + DRAM
meta_prototype_dual_core_simd_offchip two Meta-prototype compute cores (+ pooling, SIMD, DRAM)

✓ = completes through the generic CO pipeline. All combinations run in the default fast suite; on these small single-fusion-group workloads even the 36-core simba mesh finishes in seconds.

Run one combination - the justfile wraps scripts/main_stream_co.py; hw is any hardware stem from the table (default tpu_like_quad_core):

just co-2conv fusemax           # 2-conv on an architecture
just co-swiglu simba_small      # swiglu on an architecture

Equivalently, the raw entry-point call:

python scripts/main_stream_co.py \
  --hardware stream/inputs/examples/hardware/fusemax.yaml \
  --workload stream/inputs/testing/workload/2conv_1_8_32_32_16_32_3.onnx

Run the whole matrix - the justfile wraps pytest tests/test_hardware_combinations.py, which runs 2-conv + swiglu over all eight architectures plus a parse-only check confirming every hardware definition loads:

just matrix          # parse + 2-conv + swiglu over all 8 architectures (incl. simba)

🖥️ Command-Line Entry Points

All entry-point scripts live in scripts/ and are run from the repo root (so relative input paths resolve and stream imports as the installed package).

Script Purpose
scripts/main_stream_co.py Generic CO pipeline for any workload + hardware pair; manual or auto-generated mapping; YAML summary output. General-purpose (non-AIE).
scripts/main_gemm.py CO allocation + optional AIE MLIR codegen for GEMM workloads (AMD Strix AIE).
scripts/main_swiglu.py CO allocation + optional AIE MLIR codegen for SwiGLU workloads (AMD Strix AIE).
scripts/main_swiglu_dse_single.py Single-mapping SwiGLU DSE evaluation (AIE).
scripts/main_swiglu_dse.py Multi-mapping SwiGLU DSE sweep over tile sizes (AIE).
scripts/main_aie_co.py CO allocation for a hard-coded single AIE tile workload (no args; run as python scripts/main_aie_co.py).
scripts/main_gemm_codegen.py Direct GEMM → AIE MLIR codegen via xDSL transforms (no CO pipeline); --M/--N/--K.

scripts/main_stream_co.py is the general-purpose entry point. The others are AIE-specific: they hardwire AMD Strix or single-tile AIE hardware, and codegen requires NPU hardware. Note that scripts/main_aie_co.py takes no arguments (all paths are hard-coded). Plotting and trace post-processing utilities live in scripts/analysis/.

Full scripts/main_stream_co.py CLI syntax:

python scripts/main_stream_co.py \
  --hardware PATH_TO_HW_YAML \
  --workload PATH_TO_ONNX \
  [--mapping PATH_TO_MAPPING_YAML]  # omit for auto-generated mapping
  [--output OUTPUT_DIR]             # default: "outputs"
  [--experiment-id ID]
  [--skip-if-exists]

🐍 Public API

The public API lives in stream/api.py.

The primary entry point is optimize_allocation_co_generic, which auto-generates the mapping from the workload and hardware (no hand-written mapping YAML needed). This snippet is confirmed to run and print total_latency: 14344.0 (the 2-conv ONNX it references is produced by just gen-workloads):

import tempfile
from stream.api import configure_logging, optimize_allocation_co_generic

configure_logging()

with tempfile.TemporaryDirectory() as tmp:
    ctx = optimize_allocation_co_generic(
        hardware="stream/inputs/examples/hardware/tpu_like_quad_core.yaml",
        workload="stream/inputs/testing/workload/2conv_1_8_32_32_16_32_3.onnx",
        experiment_id="my-first-run",
        output_path=tmp,
    )
    print("total_latency:", ctx.get("total_latency"))
    print("group_latencies:", ctx.get("group_latencies"))

Expected output: total_latency: 14344.0.

The other two public functions:

  • optimize_allocation_co_with_mapping(hardware, workload, mapping, experiment_id, output_path, ...) - runs CO with a hand-written mapping YAML. optimize_allocation_co is a backward-compatible alias for it (both names importable).
  • optimize_mapping(hardware, workload, experiment_id, output_path, max_nb_mappings=20, ...) - DSE pipeline: enumerates mapping variants and runs CO for each.

All three return a StageContext. Useful keys: ctx.get("total_latency"), ctx.get("group_latencies"), ctx.get("scheduler"), ctx.get("workload"), ctx.get("accelerator").


🤖 MCP Server (for AI agents)

Stream ships an MCP server (stream/mcp/server.py, server name stream) that lets an AI agent submit and inspect TETRA CO jobs. Requires the [mcp] extra (pip install -e ".[mcp]").

⚠️ Install caveat: [mcp] does not currently resolve against the pinned PyPI xdsl 0.29.1 - fastmcp's dependency tree needs newer typing-extensions/pydantic than xdsl 0.29.1 permits. For now it installs only in the dev environment that uses the git build of xdsl; a clean fix awaits the xdsl upgrade.

Launch command (from the repo root):

python3 -c "from stream.mcp.server import mcp; mcp.run(transport='stdio')"

The server runs on STDIO (JSON-RPC) transport and blocks until the client disconnects.

The 6 tools:

Tool Purpose
run_optimization(hardware, workload, mapping, output_path, backend, ...) Submit a TETRA CO job; returns a job_id immediately; solve runs in the background.
poll_optimization(job_id) Check job status (pending / running / complete / failed / not_found).
get_workload_ir(workload=None, experiment_id=None) Return the workload DAG as WorkloadIR JSON.
get_accelerator_ir(hardware=None, experiment_id=None) Return the hardware model as AcceleratorIR JSON.
get_allocation_ir(job_id) Return the TETRA allocation result as AllocationIR JSON (3 persona views).
get_solve_stats(job_id) Return MILP solve statistics (objective, time, gap, node count, backend).

Run / poll / inspect flow:

  1. run_optimization(...) returns {"job_id": "...", "status": "pending"}.
  2. Poll poll_optimization(job_id) until {"status": "complete"}.
  3. Inspect with get_allocation_ir(job_id) for the AllocationIR (algorithmic / hardware / compiler views) and get_solve_stats(job_id) for solve statistics.

🧠 Working in This Repo (AI agents)

Programmatic / IR API for structured JSON output:

from stream.ir import WorkloadIR, AcceleratorIR, AllocationIR

# After running optimize_allocation_co_generic(...)
workload_ir = WorkloadIR.from_internal(ctx.get("workload"))
accelerator_ir = AcceleratorIR.from_internal(ctx.get("accelerator"))
allocation_ir = AllocationIR.from_internal(ctx.get("scheduler"))

workload_data = workload_ir.model_dump()      # JSON-compatible dict
hardware_data = accelerator_ir.model_dump()
allocation_data = allocation_ir.model_dump()

AllocationIR offers .algorithmic_view(), .hardware_view(), and .compiler_view() persona views.


📚 Further Documentation

About

Multi-core HW accelerator mapping optimization framework for layer-fused ML workloads.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors