Description
Unified graph capture infrastructure that records Warp operations (kernel
launches, memory copies, memsets, allocations) during wp.ScopedCapture and
supports two consumers:
- APIC serialization — Serialize the captured graph to a portable
.wrp
binary format and replay it from Python or standalone C++ without the
original Python program.
- CPU graph replay — Replay captured operations from the APIC byte stream,
eliminating Python per-kernel dispatch overhead.
Both share a single C++ recording layer (the APIC byte stream in
APICStateInternal) and the same public API (capture_begin / capture_end /
capture_launch / capture_save / capture_load).
Implemented
- Record
wp.launch, wp.copy, array.zero_(), and allocations during graph
capture on both CPU and CUDA devices
- APIC: serialize to
.wrp + _modules/ (compiled kernels as .cubin/.ptx
for CUDA, .o for CPU)
- APIC: load and replay in Python via
capture_load() and in standalone C++
via the wp_apic_* C API
- APIC: named input/output bindings for supplying new data without graph rebuild
(set_param / get_param)
- APIC:
wp.Mesh serialization with handle pointer remapping (wp.handle type)
- CPU: replay via byte-stream interpretation (
wp_apic_cpu_replay_state /
wp_apic_cpu_replay_graph) with no Python involvement per operation
- Backward-compatible with existing
wp.capture_* APIs
- Standalone C++ examples: CUDA OpenGL visualization
(02_apic_visualization), CPU-only OpenGL visualization
(03_apic_visualization_cpu)
Not yet supported
array.fill_() during CPU capture (wp_memtile_host not recorded)
wp.Volume / wp.BVH serialization
wp.capture_if / wp.capture_while (conditional and loop graph nodes)
- Stream event nodes (
record_event, wait_event)
- Texture array copies
- Multi-GPU support
- Compilation recording (deferred graph construction)
wp.Tape backward pass during capture (access violation on CPU — forward
arrays are not executed during deferred capture)
Coverage
Operations supported during graph capture:
| Operation |
CPU |
CUDA |
wp.launch |
Yes |
Yes |
wp.copy |
Yes |
Yes |
array.zero_() |
Yes |
Yes |
array.fill_() |
No (memtile not recorded) |
Yes (CUDA kernel) |
wp.zeros / wp.empty (in-capture alloc) |
Yes |
Yes |
wp.Mesh queries |
Yes |
Yes |
wp.Mesh serialization |
Yes |
Yes |
Tests: 55 total (28 CPU+CUDA graph tests, 16 APIC round-trip tests, 6 mesh
serialization tests, 5 additional tests enabled on CPU).
Design docs
- Analysis — per-example catalog of captured operations and coverage matrix
- v1 Design — Python recording approach (superseded)
- v2 Design — C++ byte stream recording, architecture, implementation notes
Usage
import warp as wp
# CPU graph capture (already worked on CUDA)
with wp.ScopedCapture(device="cpu") as capture:
wp.launch(my_kernel, dim=n, inputs=[a, b], device="cpu")
wp.capture_launch(capture.graph)
# APIC serialization
with wp.ScopedCapture(device="cuda:0", apic=True) as capture:
wp.launch(my_kernel, dim=n, inputs=[a, b], device="cuda:0")
wp.capture_save(capture.graph, "my_graph",
inputs={"a": a}, outputs={"b": b})
# Load and replay (Python, also available as standalone C API)
loaded = wp.capture_load("my_graph", device="cuda:0")
wp.capture_launch(loaded)
Description
Unified graph capture infrastructure that records Warp operations (kernel
launches, memory copies, memsets, allocations) during
wp.ScopedCaptureandsupports two consumers:
.wrpbinary format and replay it from Python or standalone C++ without the
original Python program.
eliminating Python per-kernel dispatch overhead.
Both share a single C++ recording layer (the APIC byte stream in
APICStateInternal) and the same public API (capture_begin/capture_end/capture_launch/capture_save/capture_load).Implemented
wp.launch,wp.copy,array.zero_(), and allocations during graphcapture on both CPU and CUDA devices
.wrp+_modules/(compiled kernels as.cubin/.ptxfor CUDA,
.ofor CPU)capture_load()and in standalone C++via the
wp_apic_*C API(
set_param/get_param)wp.Meshserialization with handle pointer remapping (wp.handletype)wp_apic_cpu_replay_state/wp_apic_cpu_replay_graph) with no Python involvement per operationwp.capture_*APIs(
02_apic_visualization), CPU-only OpenGL visualization(
03_apic_visualization_cpu)Not yet supported
array.fill_()during CPU capture (wp_memtile_hostnot recorded)wp.Volume/wp.BVHserializationwp.capture_if/wp.capture_while(conditional and loop graph nodes)record_event,wait_event)wp.Tapebackward pass during capture (access violation on CPU — forwardarrays are not executed during deferred capture)
Coverage
Operations supported during graph capture:
wp.launchwp.copyarray.zero_()array.fill_()wp.zeros/wp.empty(in-capture alloc)wp.Meshquerieswp.MeshserializationTests: 55 total (28 CPU+CUDA graph tests, 16 APIC round-trip tests, 6 mesh
serialization tests, 5 additional tests enabled on CPU).
Design docs
Usage