Skip to content

APIC MVP #1349

@c0d1f1ed

Description

@c0d1f1ed

Description

Unified graph capture infrastructure that records Warp operations (kernel
launches, memory copies, memsets, allocations) during wp.ScopedCapture and
supports two consumers:

  1. APIC serialization — Serialize the captured graph to a portable .wrp
    binary format and replay it from Python or standalone C++ without the
    original Python program.
  2. CPU graph replay — Replay captured operations from the APIC byte stream,
    eliminating Python per-kernel dispatch overhead.

Both share a single C++ recording layer (the APIC byte stream in
APICStateInternal) and the same public API (capture_begin / capture_end /
capture_launch / capture_save / capture_load).

Implemented

  • Record wp.launch, wp.copy, array.zero_(), and allocations during graph
    capture on both CPU and CUDA devices
  • APIC: serialize to .wrp + _modules/ (compiled kernels as .cubin/.ptx
    for CUDA, .o for CPU)
  • APIC: load and replay in Python via capture_load() and in standalone C++
    via the wp_apic_* C API
  • APIC: named input/output bindings for supplying new data without graph rebuild
    (set_param / get_param)
  • APIC: wp.Mesh serialization with handle pointer remapping (wp.handle type)
  • CPU: replay via byte-stream interpretation (wp_apic_cpu_replay_state /
    wp_apic_cpu_replay_graph) with no Python involvement per operation
  • Backward-compatible with existing wp.capture_* APIs
  • Standalone C++ examples: CUDA OpenGL visualization
    (02_apic_visualization), CPU-only OpenGL visualization
    (03_apic_visualization_cpu)

Not yet supported

  • array.fill_() during CPU capture (wp_memtile_host not recorded)
  • wp.Volume / wp.BVH serialization
  • wp.capture_if / wp.capture_while (conditional and loop graph nodes)
  • Stream event nodes (record_event, wait_event)
  • Texture array copies
  • Multi-GPU support
  • Compilation recording (deferred graph construction)
  • wp.Tape backward pass during capture (access violation on CPU — forward
    arrays are not executed during deferred capture)

Coverage

Operations supported during graph capture:

Operation CPU CUDA
wp.launch Yes Yes
wp.copy Yes Yes
array.zero_() Yes Yes
array.fill_() No (memtile not recorded) Yes (CUDA kernel)
wp.zeros / wp.empty (in-capture alloc) Yes Yes
wp.Mesh queries Yes Yes
wp.Mesh serialization Yes Yes

Tests: 55 total (28 CPU+CUDA graph tests, 16 APIC round-trip tests, 6 mesh
serialization tests, 5 additional tests enabled on CPU).

Design docs

  • Analysis — per-example catalog of captured operations and coverage matrix
  • v1 Design — Python recording approach (superseded)
  • v2 Design — C++ byte stream recording, architecture, implementation notes

Usage

import warp as wp

# CPU graph capture (already worked on CUDA)
with wp.ScopedCapture(device="cpu") as capture:
    wp.launch(my_kernel, dim=n, inputs=[a, b], device="cpu")
wp.capture_launch(capture.graph)

# APIC serialization
with wp.ScopedCapture(device="cuda:0", apic=True) as capture:
    wp.launch(my_kernel, dim=n, inputs=[a, b], device="cuda:0")
wp.capture_save(capture.graph, "my_graph",
                inputs={"a": a}, outputs={"b": b})

# Load and replay (Python, also available as standalone C API)
loaded = wp.capture_load("my_graph", device="cuda:0")
wp.capture_launch(loaded)

Metadata

Metadata

Assignees

Labels

feature requestRequest for something to be added
No fields configured for Enhancement.

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions