Skip to content

Warp determinism#1355

Draft
mmacklin wants to merge 109 commits into
NVIDIA:mainfrom
mmacklin:warp-deterministic
Draft

Warp determinism#1355
mmacklin wants to merge 109 commits into
NVIDIA:mainfrom
mmacklin:warp-deterministic

Conversation

@mmacklin

@mmacklin mmacklin commented Apr 10, 2026

Copy link
Copy Markdown
Collaborator

Description

Add deterministic execution mode for supported atomic patterns via wp.config.deterministic, with module-level and unique-kernel overrides through the existing module options system.

When enabled, floating-point accumulation atomics (atomic_add, atomic_sub, atomic_min, atomic_max) are transparently redirected through a deterministic scatter-sort-reduce path, and counter / allocator patterns that consume the atomic return value use an automatic two-pass count-scan-execute path. This provides bit-exact reproducible CUDA results across runs without requiring users to manually rewrite kernels.

This PR also adds targeted fixes and coverage for deterministic launch edge cases:

  • suppress side effects during the Phase 0 counting pass
  • separate scatter buffers by target and reduction op
  • improve scatter buffer capacity accounting
  • preserve wp.launch(..., record_cmd=True) support for deterministic kernels

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Test plan

Verified with:

uv run warp/tests/test_deterministic.py
uvx pre-commit run --files warp/_src/codegen.py warp/_src/context.py warp/_src/deterministic.py warp/tests/test_deterministic.py
uvx pre-commit run --files design/deterministic-execution.md warp/_src/context.py

The deterministic test module covers:

  • reproducibility for float accumulation atomics
  • += lowering to deterministic atomics
  • float64 support
  • multi-array and 2D indexing cases
  • counter / allocator reproducibility and correctness
  • mixed counter + accumulation kernels
  • per-kernel module option override
  • phase-0 side-effect suppression
  • mixed reduce ops on the same array
  • record_cmd=True support for deterministic launches

New feature / enhancement

import numpy as np
import warp as wp

wp.init()
wp.config.deterministic = True

@wp.kernel
def scatter_add(values: wp.array(dtype=wp.float32),
                indices: wp.array(dtype=wp.int32),
                out: wp.array(dtype=wp.float32)):
    tid = wp.tid()
    wp.atomic_add(out, indices[tid], values[tid])

n = 1024
values_np = np.ones(n, dtype=np.float32)
indices_np = np.arange(n, dtype=np.int32) % 16

values = wp.array(values_np, dtype=wp.float32, device="cuda")
indices = wp.array(indices_np, dtype=wp.int32, device="cuda")

results = []
for _ in range(5):
    out = wp.zeros(16, dtype=wp.float32, device="cuda")
    wp.launch(scatter_add, dim=n, inputs=[values, indices], outputs=[out], device="cuda")
    results.append(out.numpy())

# Bit-exact reproducibility across runs
for i in range(1, len(results)):
    np.testing.assert_array_equal(results[0], results[i])

Summary by CodeRabbit

  • New Features

    • Added a deterministic execution mode for atomics with global/module/kernel toggles, configurable record limits and debug diagnostics; supports ordered scatter reductions and two-pass counter/allocator semantics with GPU implementation and CPU fallbacks.
  • Documentation

    • Added a detailed design specification and changelog entry describing modes, supported patterns, limitations, and configuration semantics.
  • Tests

    • Added extensive deterministic tests covering reproducibility, correctness, overrides, capacity/overflow, and capture/replay.
  • Chores

    • Build system extended to include native deterministic sources for CPU/GPU.

Closes #1443

@copy-pr-bot

copy-pr-bot Bot commented Apr 10, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented Apr 10, 2026

Copy link
Copy Markdown

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds an opt-in deterministic execution mode for supported atomic operations: new global/module/kernel config flags, codegen interception to emit scatter or two‑phase counter patterns, deterministic metadata and buffer management, CUDA device/CPU stubs for deterministic sort‑reduce, build updates, and extensive tests and benchmarks.

Changes

Cohort / File(s) Summary
Configuration & Design
CHANGELOG.md, design/deterministic-execution.md, warp/config.py
Introduces wp.config.deterministic and deterministic_debug, adds changelog entry and a design doc describing determinism modes, supported atomic patterns, limits, and debug behavior.
Codegen & Deterministic Metadata
warp/_src/codegen.py, warp/_src/deterministic.py
Adds interceptable atomic allowlist and order-dependent classification, DeterministicMeta and target dataclasses, codegen paths for Pattern A (scatter sort‑reduce) and Pattern B (two‑pass counter), assign-time return-use tracking, store suppression in phase 0, and helpers for target deduplication and buffer sizing.
Context, Kernel Options & Launch Runtime
warp/_src/context.py
Adds kernel decorator params deterministic and deterministic_max_records, per-kernel/module resolution and hashing to populate det_meta, DeterministicLaunch and _launch_deterministic to allocate buffers, orchestrate phase runs, update counters, and invoke post‑kernel sort/reduce; augments generated kernel signatures with hidden deterministic parameters and includes new ctypes binding.
Native Device Support & Headers
warp/native/deterministic.cu, warp/native/deterministic.h, warp/native/deterministic.cpp, warp/native/warp.h
Adds CUDA implementation for deterministic scatter/sort + segmented reduction with scalar-type dispatch, device scatter helper, header API (wp::deterministic::scatter), public entry wp_deterministic_sort_reduce_device, and CPU stub for non-CUDA builds.
Build System
build_lib.py
Adds native/deterministic.cpp to CPU build units and native/deterministic.cu to CUDA build units so deterministic native code is compiled into libraries.
Tests & Suite Integration
warp/tests/test_deterministic.py, warp/tests/unittest_suites.py, warp/tests/test_unique_module.py
Adds comprehensive deterministic tests covering scatter, counters, mixed ops, float64, capacity/overflow, per-kernel/module overrides, CUDA graph/recording/graph capture, and module-hashing tests; integrates TestDeterministic into the default suite.
Benchmarks
asv/benchmarks/atomics.py
Adds deterministic benchmark variants and sizes, new deterministic kernels, captures CUDA graphs for repeated timed replays to measure determinism overhead.

Sequence Diagram(s)

sequenceDiagram
    participant User as User Kernel
    participant Python as Python Runtime
    participant CUDA as CUDA Device
    participant ScatterBuf as Scatter Buffers
    participant SortReduce as Sort-Reduce Kernel
    participant Dest as Destination Array

    User->>CUDA: Launch kernel (scatter pattern)
    CUDA->>ScatterBuf: wp::deterministic::scatter (pack key+value, inc counter)
    CUDA->>Python: Kernel returns
    Python->>ScatterBuf: Read record count
    Python->>SortReduce: Call device sort-reduce entry
    SortReduce->>ScatterBuf: Radix sort keys/values
    SortReduce->>Dest: Deterministic segment-wise reduce into dest
    SortReduce->>Python: Complete
    Python->>User: Return results
Loading
sequenceDiagram
    participant User as User Kernel
    participant Python as Python Runtime
    participant Phase0 as Phase 0 (Count)
    participant Scan as Prefix Scan
    participant Phase1 as Phase 1 (Execute)
    participant Counter as Counter Array

    User->>Python: Launch deterministic kernel (counter pattern)
    Python->>Phase0: Run kernel with _wp_det_phase=0 (suppress side effects)
    Phase0->>Python: Accumulated per-counter counts
    Python->>Scan: Compute deterministic prefix sums (offsets)
    Scan->>Python: Offsets ready
    Python->>Phase1: Run kernel with _wp_det_phase=1 (use offsets)
    Phase1->>Counter: Writeback using deterministic offsets
    Phase1->>Python: Complete
    Python->>User: Return results
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 58.02% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'Warp determinism' is vague and generic. While it references determinism (a real part of the changeset), it is too broad and non-descriptive to clearly convey the primary change to someone scanning pull request history. Consider a more descriptive title that clarifies the main feature, such as 'Add deterministic execution mode for atomic operations' or 'Implement configurable deterministic atomics with scatter-sort-reduce'.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@greptile-apps

greptile-apps Bot commented Apr 10, 2026

Copy link
Copy Markdown

Greptile Summary

This PR introduces a deterministic execution mode (wp.config.deterministic) for Warp kernels, providing bit-exact reproducible results across runs (RUN_TO_RUN) or across GPU architectures (GPU_TO_GPU) for supported atomic patterns. The implementation spans ~5000 lines across new C++ device code, Python codegen infrastructure, and launch orchestration.

  • Scatter-sort-reduce path: Accumulation atomics with discarded return values (e.g. wp.atomic_add used for side effect only) are redirected to a temporary scatter buffer keyed by (dest_index << 32 | thread_id), CUB radix-sorted, then segmented-reduced in fixed order, replacing the original non-deterministic in-place atomics.
  • Two-pass counter path: Consumed-return atomics (e.g. slot = wp.atomic_add(counter, 0, 1)) use a phase-0 counting pass with stores suppressed, followed by CUB prefix scan, then a phase-1 replay that substitutes deterministic slot values — enabling reproducible allocator patterns.
  • Graph capture support: record_cmd=True returns a DeterministicLaunch object; Graph._deterministic_buffer_refs retains scratch buffers alive for the full graph replay lifetime.

Confidence Score: 4/5

PR is safe to merge with awareness of two known open issues tracked in prior review threads (integer atomic double-execution in @wp.func helpers within counter kernels, and _refresh_deterministic_launch_metadata thread safety outside _codegen_lock).

Large, well-structured new feature with comprehensive test coverage across scatter, counter, backward, graph capture, and module options paths. Previously flagged issues (float16 dispatch, backward det params, buffer lifetime, sentinel key handling) have all been addressed. Two issues from prior review rounds remain open but are edge-case concerns rather than correctness-breaking for the primary use cases. No new P0 or P1 issues were found in this review pass.

warp/_src/deterministic.py and warp/_src/context.py — these house the two open issues from prior threads (helper integer atomic double-execution and _refresh_deterministic_launch_metadata lock absence).

Important Files Changed

Filename Overview
warp/_src/deterministic.py New 2019-line core implementation: ScatterTarget/CounterTarget/DeterministicMeta/DeterministicRegistry for per-build stable helper naming, codegen helpers (emit_deterministic_atomic, _det_wrap_slot_store, etc.), and the Python-side launch orchestration. Overflow detection fires after partial scatter results are committed (P2 concern), and integer atomic_add in @wp.func helpers can double-execute in counter kernels when has_consumed_atomic=False (flagged in prior threads).
warp/native/deterministic.cu New 1382-line CUDA implementation: BinnedFloatAccumulator (Ahrens-Demmel-Nguyen GPU_TO_GPU reproducible summation), CUB-based radix sort + segmented reduce-by-key, CUB prefix scan, and counter writeback. Bounds checks correctly guard dest < 0 sentinel records. All 8 scalar types dispatched correctly including wp::half for float16 (type_id=0).
warp/native/deterministic.h New 300-line device-side header with det_scatter_buf_t, det_counter_buf_t, det_ctx structs; device functions; and 10 phase-gating macros (CUDA and CPU variants). Logic is correct; counter_add has appropriate overflow guards and two-pass phase branching.
warp/_src/codegen.py ~266 lines changed to thread DeterministicMeta/DeterministicRegistry through Adjoint, add det_ctx args/locals to kernel/function codegen, and gate atomic emissions via emit_deterministic_atomic. The _det_atomic_return_discarded pre-mark in emit_Expr correctly determines whether to route through scatter or counter path.
warp/_src/context.py ~290 lines changed: adds ctypes structs, module options, Graph._deterministic_buffer_refs for buffer lifetime during graph replay, _refresh_deterministic_launch_metadata for cache-hit metadata repopulation. _refresh_deterministic_launch_metadata is called outside _codegen_lock (thread-safety concern flagged in prior threads).
warp/config.py Adds DeterministicMode enum (NOT_GUARANTEED=0, RUN_TO_RUN=1, GPU_TO_GPU=2), three new config fields, and type validation. Clean implementation with no issues found.
warp/tests/deterministic/test_deterministic_scatter.py New test file covering float accumulation atomics, += lowering, float64, multi-array, and 2D indexing determinism.
warp/tests/deterministic/test_deterministic_counter.py New test file covering counter/allocator reproducibility, mixed scatter+counter patterns, phase-0 suppression, and integer accumulation alongside consumed atomics.
warp/tests/deterministic/test_deterministic_backward.py New test file verifying backward pass gradient correctness under deterministic mode.
warp/tests/deterministic/test_deterministic_graph_capture.py New test file for record_cmd=True (DeterministicLaunch) and CUDA graph capture/replay with buffer lifetime verification.
warp/tests/test_unique_module.py 123 lines added: tests module option identity, cache-hit det_meta repopulation, and global config capture at module creation time.

Sequence Diagram

sequenceDiagram
    participant User as Python User
    participant Launch as launch()
    participant LD as launch_deterministic()
    participant Kernel as GPU Kernel
    participant Sort as wp_deterministic_sort_reduce
    participant Scan as wp_deterministic_counter_scan
    participant WB as wp_deterministic_counter_writeback

    User->>Launch: "wp.launch(kernel, deterministic=RUN_TO_RUN)"
    Launch->>Launch: detect scatter/counter targets in adj.det_meta

    alt Scatter-only kernel (accumulation atomics, discarded return)
        Launch->>LD: "launch_deterministic(phase=1)"
        LD->>Kernel: "launch with phase=1, scatter_buf"
        Kernel->>Kernel: "WP_DET_SCATTER_OR_FALLBACK scatter(key=dest<<32|tid, val)"
        LD->>Sort: wp_deterministic_sort_reduce_device(buf, dest_array)
        Sort->>Sort: CUB radix sort by key (dest_idx, thread_id)
        Sort->>Sort: segmented reduce-by-key deterministic accumulation
        Sort-->>LD: results written to dest_array
    else Counter kernel (consumed atomic_add return)
        Note over LD,WB: Phase 0 - counting pass
        Launch->>LD: "launch_deterministic(phase=0)"
        LD->>Kernel: "launch with phase=0, counter_buf"
        Kernel->>Kernel: "WP_DET_COUNTER_OR_FALLBACK counter_add phase=0"
        Kernel->>Kernel: WP_DET_STORE_IF_ACTIVE suppresses array stores
        Kernel->>Kernel: WP_DET_SIDE_EFFECT_IF_ACTIVE suppresses integer accum atomics
        LD->>Scan: wp_deterministic_counter_scan_device(counter_buf)
        Scan->>Scan: "CUB prefix scan prefix[slot] = deterministic base offset"
        Note over LD,WB: Phase 1 - replay with deterministic slots
        LD->>Kernel: "launch with phase=1, counter_buf (prefixes filled)"
        Kernel->>Kernel: "WP_DET_COUNTER_OR_FALLBACK counter_add phase=1 return prefixes[slot]"
        LD->>WB: wp_deterministic_counter_writeback_device(counter_buf, user_counter_array)
        WB-->>LD: user counter updated to deterministic total
    end
Loading

Reviews (21): Last reviewed commit: "Clean up naming of internal helpers in d..." | Re-trigger Greptile

Comment thread warp/_src/deterministic.py Outdated
Comment thread warp/tests/test_deterministic.py Outdated
Comment thread warp/_src/context.py Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

🧹 Nitpick comments (1)
warp/native/deterministic.h (1)

41-49: Clarify the non-CUDA branch comment.

Line 41 says “direct accumulation,” but Lines 42-48 are a no-op. Rewording this would prevent confusion during maintenance.

✏️ Suggested comment tweak
-    // CPU path: direct accumulation (CPU kernels are sequential).
+    // Non-CUDA path: no-op for this helper (CPU execution does not use scatter buffers).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@warp/native/deterministic.h` around lines 41 - 49, The comment "CPU path:
direct accumulation (CPU kernels are sequential)" is misleading because the
non-CUDA branch simply voids a set of unused variables (keys, values, counter,
capacity, dest_flat_idx, thread_id, value); change the comment near that void
list in deterministic.h to clearly state these variables are intentionally
unused in the non-CUDA build (e.g., "Non-CUDA build: no per-thread accumulation
— explicitly mark these kernel-specific variables as unused") so maintainers
understand why the (void)XXX lines are present.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@CHANGELOG.md`:
- Around line 7-11: Update the CHANGELOG entry (mentioning
wp.config.deterministic) to be API-level and include a GitHub issue/PR
reference: remove internal implementation details like "scatter-sort-reduce" and
"two-pass execution with prefix-sum-based slot assignment", instead describe the
user-visible change (e.g., "Added a deterministic execution mode for atomic
operations via wp.config.deterministic = True that makes atomic accumulations
reproducible across runs"), append a short note about scope
(global/module/kernel) and add the GH issue/PR number (e.g., "See `#1234`") and
affected version/release tag.

In `@design/deterministic-execution.md`:
- Around line 108-109: The doc currently mixes exclusive and inclusive scan
conventions: update the design to use a single convention for
wp.utils.array_scan(writeback rule) — either always document it as exclusive
(inclusive=False) and state that the total is computed as prefix[last] +
contrib[last], or document it as inclusive (inclusive=True) and state that the
total is prefix[last]; make the descriptions at the wp.utils.array_scan example
near "Prefix sum" (previously showing inclusive=False) and the section that
references "total comes from the last element" (lines ~159-160) consistent by
choosing one convention and adjusting the text to reference
wp.utils.array_scan(contrib, prefix, inclusive=<True/False>) and the
corresponding rule for computing the thread-local total.

In `@warp/_src/codegen.py`:
- Around line 42-44: The frozenset _DET_ORDER_DEPENDENT_ATOMICS (containing
"atomic_cas" and "atomic_exch") is defined but unused; either remove it or
implement the promised "warn but don't intercept" behavior: locate the code path
that dispatches/handles atomic ops (e.g., the function that processes atomic
intrinsics or emits atomics such as the atomic handling/emit function in
codegen.py), and when an atomic op name is in _DET_ORDER_DEPENDENT_ATOMICS, emit
a clear warning (use warnings.warn or the module logger) stating that
order-dependent atomics are not intercepted and will run with native ordering,
then continue normal processing without interception; if you prefer removal,
delete the _DET_ORDER_DEPENDENT_ATOMICS constant and any related comment and add
a brief unit test or code comment documenting the choice.
- Around line 1881-1882: The fallback that sets flat_idx_expr = "0" when ndim >
4 is unsafe because it silently writes everything to index 0; replace this
silent fallback with an explicit error: detect the unsupported ndim case in the
same code path in codegen.py (the branch that currently assigns flat_idx_expr)
and raise a clear exception (e.g., ValueError or RuntimeError) that includes the
invalid ndim value and a message stating Warp arrays support up to 4 dimensions;
do not assign "0" as a default index.

In `@warp/_src/context.py`:
- Around line 7477-7480: set_param_at_index_from_ctype() and
set_params_from_ctypes() mutate self.params but do not keep self.fwd_args in
sync, so _launch_deterministic() can replay using stale array objects; update
those methods to mirror the logic in set_param_at_index(): when adjoint is False
and the target index is within range(len(self.fwd_args)), assign the new value
into self.fwd_args[index] (or for bulk updates, update the corresponding
slice/indices) — or refactor those methods to call set_param_at_index(index,
value, adjoint) for each changed param so fwd_args stays consistent with params
for deterministic replay.
- Around line 7589-7602: The two-pass path must preserve the caller stream and
keep the counter's initial value: ensure the intermediate scans
(warp._src.utils.array_scan calls), the warp.copy that writes the total back
into the counter array, and the subsequent run_sort_reduce invocation are
executed on the same stream passed into this path (propagate the local stream
object into those calls or use stream-aware variants) and when writing the
counter combine the existing counter value with this-launch total (read the
current counter_arr[0], add inclusive_out[dim_size-1] and write the sum back)
instead of overwriting; update references around array_scan, inclusive_out,
counter_arr (found via kernel.adj.args and fwd_args), warp.copy and
run_sort_reduce so they all use the caller stream and perform an atomic/ordered
add of the previous counter value plus the new total.

In `@warp/_src/deterministic.py`:
- Around line 140-161: The mapping and helpers currently claim support for
warp.float16 but the native reducer entrypoint
wp_deterministic_sort_reduce_float_device (and the C++ reinterpret-casts in
native/deterministic.cu) do not handle 16-bit halves; update _WARP_TO_CTYPE and
the conversion helpers to fail fast for half: remove or change the "wp::half"
mapping so warp_type_to_ctype raises for warp.float16 (or explicitly check and
raise in is_float_type/warp_type_to_ctype), and add a clear ValueError
mentioning warp.float16 and wp_deterministic_sort_reduce_float_device so any
attempt to use half reductions immediately errors until a real half reducer is
implemented.

In `@warp/tests/test_deterministic.py`:
- Around line 789-790: Remove the call to wp.clear_kernel_cache() from this test
module (including the __main__ block) because wp.clear_kernel_cache() is
disallowed in test files; simply delete the line invoking
wp.clear_kernel_cache() so no cache-clearing is performed here.

In `@warp/tests/unittest_suites.py`:
- Line 142: The default_suite() function imports TestDeterministic but never
includes it in the test_classes list, so deterministic tests are skipped; update
the test_classes array/variable inside default_suite() to include
TestDeterministic (alongside the other classes), ensuring the symbol
TestDeterministic is added to the list used to build the suite returned by
default_suite().

---

Nitpick comments:
In `@warp/native/deterministic.h`:
- Around line 41-49: The comment "CPU path: direct accumulation (CPU kernels are
sequential)" is misleading because the non-CUDA branch simply voids a set of
unused variables (keys, values, counter, capacity, dest_flat_idx, thread_id,
value); change the comment near that void list in deterministic.h to clearly
state these variables are intentionally unused in the non-CUDA build (e.g.,
"Non-CUDA build: no per-thread accumulation — explicitly mark these
kernel-specific variables as unused") so maintainers understand why the
(void)XXX lines are present.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 2e7bc7a6-3bed-4194-bd96-d94feaed2cf7

📥 Commits

Reviewing files that changed from the base of the PR and between 53a7bf5 and 5ec9b25.

📒 Files selected for processing (13)
  • CHANGELOG.md
  • build_lib.py
  • design/deterministic-execution.md
  • warp/_src/codegen.py
  • warp/_src/context.py
  • warp/_src/deterministic.py
  • warp/config.py
  • warp/native/deterministic.cpp
  • warp/native/deterministic.cu
  • warp/native/deterministic.h
  • warp/native/warp.h
  • warp/tests/test_deterministic.py
  • warp/tests/unittest_suites.py

Comment thread CHANGELOG.md Outdated
Comment thread design/deterministic-execution.md Outdated
Comment thread warp/_src/codegen.py Outdated
Comment thread warp/_src/codegen.py Outdated
Comment thread warp/_src/context.py Outdated
Comment thread warp/_src/context.py Outdated
Comment thread warp/_src/deterministic.py Outdated
Comment thread warp/tests/test_deterministic.py Outdated
Comment thread warp/tests/unittest_suites.py Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (3)
warp/_src/deterministic.py (1)

140-143: ⚠️ Potential issue | 🔴 Critical

Fail fast for warp.float16 deterministic reductions.

warp.float16 is still wired through the deterministic scatter/reduce path, but the native entry points only reinterpret buffers as float* or double*. Half buffers will therefore be read with the wrong element size and produce corrupted reductions. Please reject this path until a real half reducer exists.

Minimal safe guard
-        if target.value_ctype in ("float", "wp::half"):
+        if target.value_ctype == "float":
             fn = runtime.core.wp_deterministic_sort_reduce_float_device
+        elif target.value_ctype == "wp::half":
+            raise RuntimeError("Deterministic float16 atomics are not supported yet.")
         elif target.value_ctype == "double":
             fn = runtime.core.wp_deterministic_sort_reduce_double_device

Also applies to: 159-161, 177-180, 219-223

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@warp/_src/deterministic.py` around lines 140 - 143, The mapping
_WARP_TO_CTYPE currently includes warp.float16 but the deterministic
scatter/reduce path does not support half-precision and will read buffers with
wrong element size; update deterministic.py to fail fast whenever warp.float16
is encountered in the deterministic reduction code paths by removing or guarding
the warp.float16 entry in _WARP_TO_CTYPE and adding an explicit check that
raises a clear exception (or returns an error) when a reducer or conversion
function (the code paths that reference _WARP_TO_CTYPE) sees warp.float16;
ensure the exception message names warp.float16 and the deterministic reduction
path so callers get a clear rejection until a proper half reducer is
implemented.
warp/tests/test_deterministic.py (1)

955-956: ⚠️ Potential issue | 🟠 Major

Remove the kernel-cache clear from this test module.

wp.clear_kernel_cache() is disallowed in tests and can crash parallel CI runs.

Minimal fix
 if __name__ == "__main__":
-    wp.clear_kernel_cache()
     unittest.main(verbosity=2)

As per coding guidelines, "Never call wp.clear_kernel_cache() or wp.clear_lto_cache() in test files—not in __main__ blocks, test methods, or module scope. Cache clearing is not multi-process-safe; concurrent clears cause LLVM crashes."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@warp/tests/test_deterministic.py` around lines 955 - 956, Remove the
disallowed kernel-cache clear from the module-level __main__ block: delete the
call to wp.clear_kernel_cache() found in the if __name__ == "__main__": section
of warp/tests/test_deterministic.py; do not replace it with any cache-clearing
call (wp.clear_lto_cache or similar) and ensure no other module-scope or
__main__-scoped cache-clear calls remain.
design/deterministic-execution.md (1)

111-112: ⚠️ Potential issue | 🟡 Minor

Use one scan convention throughout the doc.

Lines 111-112 describe an exclusive scan, but Lines 170-172 still say the total comes from the last element of an inclusive scan. Please make the writeback rule consistent with the convention the implementation actually uses.

Also applies to: 170-172

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@design/deterministic-execution.md` around lines 111 - 112, The doc is
inconsistent about scan convention: wp.utils.array_scan(contrib, prefix,
inclusive=False) is described as exclusive at lines 111-112 but later (lines
170-172) claims the total comes from the last element of an inclusive scan; pick
one convention and make the writeback rule consistent with the actual
implementation. Update the description of wp.utils.array_scan, the
example/notation for "prefix" and the writeback rule (the statement about where
the total/last-offset is read) so they all use the same convention (either
inclusive or exclusive) and mention the inclusive flag (inclusive=False/True) in
the writeback explanation.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@warp/_src/codegen.py`:
- Around line 1647-1650: Pattern B (two-pass deterministic atomic interception)
only runs when _det_in_assign is set during emit_Assign(), so atomic calls
nested in subscripts, call arguments, or larger expressions (e.g.,
output[wp.atomic_add(...)]) never use adj._emit_deterministic_atomic and remain
nondeterministic; update the compiler to propagate the
"deterministic-assignment" context beyond plain Assign RHS evaluation by
checking _det_in_assign (or equivalent flag) within expression emitters that can
produce targets/indices/args — specifically modify emit_Subscript, emit_Call
(and other expression emitters referenced around lines 3336-3345) to consult
adj._emit_deterministic_atomic for funcs in _DET_INTERCEPTABLE_ATOMICS (using
func.is_builtin() and func.key) and call
adj._emit_deterministic_atomic(bound_args, return_type, output, output_list)
when the flag is active so atomics inside subscripts/args follow the two-pass
deterministic path.
- Around line 1834-1849: The Pattern B counter/allocator block that runs when
return_is_consumed currently assumes an add-style atomic and unconditionally
emits contrib += value and prefix returns; restrict this path to only add-style
atomics: inside the branch guarded by return_is_consumed (the block that calls
get_or_create_counter_target and writes _wp_det_contrib and _wp_det_prefix via
adj.add_forward), inspect the counter target's atomic/op kind (from the target
returned by get_or_create_counter_target — e.g., target.op, target.atomic_op, or
whatever property names the CounterTarget uses) and only emit Pattern B when
that property indicates an atomic_add; otherwise fail fast (raise a clear
NotImplementedError or emit an error) explaining that consumed-return semantics
for atomic_sub/atomic_max are not implemented. Ensure the check is placed before
emitting the adj.add_forward code so non-add atomics do not get incorrect
rewrites.

In `@warp/_src/deterministic.py`:
- Around line 191-200: The scratch buffers in allocate_counter_buffers always
use warp.int32 which truncates non-32-bit counters; update
allocate_counter_buffers to inspect each CounterTarget.value_ctype and allocate
contrib/prefix with the matching dtype (e.g., warp.int64 or warp.uint64) so the
generated ABI names (_wp_det_contrib/_wp_det_prefix) match the counter width, or
if supporting only 32-bit is preferred, raise an explicit error when a
CounterTarget has a non-32-bit value_ctype; modify allocate_counter_buffers to
perform this dtype-selection/check using the CounterTarget.value_ctype before
creating warp.zeros/warp.empty.

---

Duplicate comments:
In `@design/deterministic-execution.md`:
- Around line 111-112: The doc is inconsistent about scan convention:
wp.utils.array_scan(contrib, prefix, inclusive=False) is described as exclusive
at lines 111-112 but later (lines 170-172) claims the total comes from the last
element of an inclusive scan; pick one convention and make the writeback rule
consistent with the actual implementation. Update the description of
wp.utils.array_scan, the example/notation for "prefix" and the writeback rule
(the statement about where the total/last-offset is read) so they all use the
same convention (either inclusive or exclusive) and mention the inclusive flag
(inclusive=False/True) in the writeback explanation.

In `@warp/_src/deterministic.py`:
- Around line 140-143: The mapping _WARP_TO_CTYPE currently includes
warp.float16 but the deterministic scatter/reduce path does not support
half-precision and will read buffers with wrong element size; update
deterministic.py to fail fast whenever warp.float16 is encountered in the
deterministic reduction code paths by removing or guarding the warp.float16
entry in _WARP_TO_CTYPE and adding an explicit check that raises a clear
exception (or returns an error) when a reducer or conversion function (the code
paths that reference _WARP_TO_CTYPE) sees warp.float16; ensure the exception
message names warp.float16 and the deterministic reduction path so callers get a
clear rejection until a proper half reducer is implemented.

In `@warp/tests/test_deterministic.py`:
- Around line 955-956: Remove the disallowed kernel-cache clear from the
module-level __main__ block: delete the call to wp.clear_kernel_cache() found in
the if __name__ == "__main__": section of warp/tests/test_deterministic.py; do
not replace it with any cache-clearing call (wp.clear_lto_cache or similar) and
ensure no other module-scope or __main__-scoped cache-clear calls remain.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 4ca0283f-84b8-4939-90c1-9a656d3d0098

📥 Commits

Reviewing files that changed from the base of the PR and between 5ec9b25 and 72e8e3f.

📒 Files selected for processing (8)
  • design/deterministic-execution.md
  • warp/_src/codegen.py
  • warp/_src/context.py
  • warp/_src/deterministic.py
  • warp/config.py
  • warp/native/deterministic.cu
  • warp/native/deterministic.h
  • warp/tests/test_deterministic.py
✅ Files skipped from review due to trivial changes (2)
  • warp/native/deterministic.h
  • warp/native/deterministic.cu
🚧 Files skipped from review as they are similar to previous changes (1)
  • warp/_src/context.py

Comment thread warp/_src/codegen.py
Comment thread warp/_src/codegen.py Outdated
Comment thread warp/_src/deterministic.py Outdated
@greptile-apps

greptile-apps Bot commented Apr 10, 2026

Copy link
Copy Markdown

Tip:

Greploop — Automatically fix all review issues by running /greploops in Claude Code. It iterates: fix, push, re-review, repeat until 5/5 confidence.

Use the Greptile plugin for Claude Code to query reviews, search comments, and manage custom context directly from your terminal.

Comment thread warp/_src/deterministic.py Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
warp/tests/test_deterministic.py (1)

1143-1145: ⚠️ Potential issue | 🟠 Major

Remove the kernel-cache clear from this test module.

wp.clear_kernel_cache() is explicitly disallowed in test files and can break parallel test runs.

Minimal fix
 if __name__ == "__main__":
-    wp.clear_kernel_cache()
     unittest.main(verbosity=2)

As per coding guidelines, "Never call wp.clear_kernel_cache() or wp.clear_lto_cache() in test files—not in __main__ blocks, test methods, or module scope. Cache clearing is not multi-process-safe; concurrent clears cause LLVM crashes."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@warp/tests/test_deterministic.py` around lines 1143 - 1145, Remove the call
to wp.clear_kernel_cache() from the module-level __main__ block; locate the if
__name__ == "__main__": block (where wp.clear_kernel_cache() is called before
unittest.main) and delete that wp.clear_kernel_cache() invocation so tests no
longer call kernel-cache clearing from the test module.
design/deterministic-execution.md (1)

114-115: ⚠️ Potential issue | 🟡 Minor

Use one scan convention throughout the design.

The Pattern B section documents array_scan(..., inclusive=False), but the writeback section still says the total comes from the last element of an inclusive scan. Please pick one convention and make the total-count rule match it everywhere.

Also applies to: 173-175

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@design/deterministic-execution.md` around lines 114 - 115, The docs use two
different scan conventions; standardize on one and make all references
consistent: choose whether wp.utils.array_scan(contrib, prefix, inclusive=True)
or inclusive=False is the canonical API, then update Pattern B and the writeback
section (and the other occurrence at lines ~173-175) so the phrase "total comes
from the last element" matches that convention (e.g., if using inclusive=False,
state that total = prefix[-1] + contrib[-1] or if inclusive=True, state total =
prefix[-1]); update any explanatory text and examples to reference
wp.utils.array_scan(..., inclusive=...) and the single total-count rule
consistently.
🧹 Nitpick comments (1)
warp/tests/test_deterministic.py (1)

692-753: Assert the actual slot order, not just counts/permutations.

counter_kernel should yield output == data_np, and conditional_counter_kernel should yield output[:expected_count] == data_np[data_np > threshold]. Right now a stable but wrong permutation would still pass because these tests only check counts, sorting, and cross-run reproducibility.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@warp/tests/test_deterministic.py` around lines 692 - 753, Update the two
tests to assert actual slot order rather than just counts/permutations: in
test_counter_correctness replace the sorted/permutation check with a direct
equality assertion that output.numpy() (or output.numpy().tolist()) exactly
equals data_np (ensuring dtype/shape match) to verify counter_kernel produces
output == data_np; in test_conditional_counter replace the permutation-based
checks with a direct equality assertion that output.numpy()[:expected_count]
equals data_np[data_np > threshold] to verify conditional_counter_kernel
preserves the original relative order of selected elements; keep the existing
cross-run determinism checks but change the comparisons to elementwise equality
rather than sorted/comparison of permutations.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@design/deterministic-execution.md`:
- Around line 114-115: The docs use two different scan conventions; standardize
on one and make all references consistent: choose whether
wp.utils.array_scan(contrib, prefix, inclusive=True) or inclusive=False is the
canonical API, then update Pattern B and the writeback section (and the other
occurrence at lines ~173-175) so the phrase "total comes from the last element"
matches that convention (e.g., if using inclusive=False, state that total =
prefix[-1] + contrib[-1] or if inclusive=True, state total = prefix[-1]); update
any explanatory text and examples to reference wp.utils.array_scan(...,
inclusive=...) and the single total-count rule consistently.

In `@warp/tests/test_deterministic.py`:
- Around line 1143-1145: Remove the call to wp.clear_kernel_cache() from the
module-level __main__ block; locate the if __name__ == "__main__": block (where
wp.clear_kernel_cache() is called before unittest.main) and delete that
wp.clear_kernel_cache() invocation so tests no longer call kernel-cache clearing
from the test module.

---

Nitpick comments:
In `@warp/tests/test_deterministic.py`:
- Around line 692-753: Update the two tests to assert actual slot order rather
than just counts/permutations: in test_counter_correctness replace the
sorted/permutation check with a direct equality assertion that output.numpy()
(or output.numpy().tolist()) exactly equals data_np (ensuring dtype/shape match)
to verify counter_kernel produces output == data_np; in test_conditional_counter
replace the permutation-based checks with a direct equality assertion that
output.numpy()[:expected_count] equals data_np[data_np > threshold] to verify
conditional_counter_kernel preserves the original relative order of selected
elements; keep the existing cross-run determinism checks but change the
comparisons to elementwise equality rather than sorted/comparison of
permutations.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: a316d310-4647-410d-a32d-1acc2968e8e5

📥 Commits

Reviewing files that changed from the base of the PR and between 72e8e3f and 48e7207.

📒 Files selected for processing (8)
  • design/deterministic-execution.md
  • warp/_src/codegen.py
  • warp/_src/context.py
  • warp/_src/deterministic.py
  • warp/native/deterministic.cpp
  • warp/native/deterministic.cu
  • warp/native/warp.h
  • warp/tests/test_deterministic.py
🚧 Files skipped from review as they are similar to previous changes (4)
  • warp/native/deterministic.cpp
  • warp/_src/context.py
  • warp/_src/codegen.py
  • warp/_src/deterministic.py

@shi-eric shi-eric added this to the 1.14.0 milestone Apr 10, 2026
Comment thread warp/_src/codegen.py Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (5)
warp/_src/context.py (2)

7628-7641: ⚠️ Potential issue | 🟠 Major

Keep the count/scan/reduce fixups on stream, and accumulate the existing counter value.

array_scan(), warp.copy(), and run_sort_reduce() all omit stream here, so an explicit non-current stream or graph capture can reorder or escape the deterministic pass. Also, Line 7641 writes only this launch's total into counter_arr[0]; a counter that started at N will end at total instead of N + total.

Also applies to: 7670-7670

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@warp/_src/context.py` around lines 7628 - 7641, The code omits the stream
argument and overwrites existing counter values; update calls to
warp._src.utils.array_scan(contrib, prefix, inclusive=False) and the inclusive
scan warp._src.utils.array_scan(contrib, inclusive_out, inclusive=True) as well
as warp.copy and any run_sort_reduce invocations to pass the current non-default
stream (preserve graph capture/determinism), and change the write into
counter_arr (found by iterating kernel.adj.args for ct.array_var_label and
obtained from fwd_args[j]) to add the inclusive_out last-element to the existing
counter value instead of replacing it (i.e., read counter_arr[0], add
inclusive_out[dim_size-1], and write the sum back using warp.copy on the given
stream); apply the same stream+accumulate fix to the similar block around the
other occurrence near the 7670 location.

7505-7508: ⚠️ Potential issue | 🟠 Major

Raw-ctype array updates are still unsafe for deterministic replays.

Line 7507 only keeps fwd_args aligned for set_param_at_index(). The inherited set_param_at_index_from_ctype() / set_params_from_ctypes() can still replace an array descriptor while _launch_deterministic() later uses the stale Python array object for warp.copy() and run_sort_reduce(). Recorded deterministic launches can therefore replay into the wrong array, or fail once fwd_args and self.params diverge. If these APIs stay exposed, DeterministicLaunch needs to reject array-ctype updates or carry the owning Warp array alongside the packed descriptor.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@warp/_src/context.py` around lines 7505 - 7508, The deterministic replay bug:
set_param_at_index only updates fwd_args but inherited
set_param_at_index_from_ctype / set_params_from_ctypes can replace raw C-type
array descriptors in self.params leaving fwd_args stale, which breaks
_launch_deterministic and later warp.copy/run_sort_reduce replay; to fix, either
(A) make DeterministicLaunch reject/raise when set_param_at_index_from_ctype or
set_params_from_ctypes is used with array-ctypes, or (B) ensure those
ctype-updating paths also update the corresponding Python owning Warp array
stored in fwd_args (and keep self.params and fwd_args synchronized) so that
set_param_at_index_from_ctype / set_params_from_ctypes maintain alignment with
fwd_args before _launch_deterministic runs.
design/deterministic-execution.md (1)

114-115: ⚠️ Potential issue | 🟡 Minor

Use one scan convention in the design doc.

Line 114 documents wp.utils.array_scan(contrib, prefix, inclusive=False), but Lines 176-178 say the total comes from the last element of an inclusive scan. Please make both sections describe the same writeback rule. If the scan stays exclusive, the total is prefix[-1] + contrib[-1].

Also applies to: 176-178

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@design/deterministic-execution.md` around lines 114 - 115, The doc currently
mixes scan conventions: update both occurrences to use the same convention for
wp.utils.array_scan(contrib, prefix, inclusive=False) and the writeback rule;
either change the function call to inclusive=True and state the total is
prefix[-1], or keep inclusive=False and change the total calculation text (the
section that currently says the total comes from the last element of an
inclusive scan) to the exclusive rule: total = prefix[-1] + contrib[-1]; make
this consistent for the wp.utils.array_scan(contrib, prefix, inclusive=False)
mention and the later total computation paragraph.
warp/tests/test_deterministic.py (1)

1190-1192: ⚠️ Potential issue | 🟠 Major

Remove the kernel-cache clear from this test module.

wp.clear_kernel_cache() is explicitly disallowed in test files and can crash parallel test runs. unittest.main() is enough here.

🧹 Minimal fix
 if __name__ == "__main__":
-    wp.clear_kernel_cache()
     unittest.main(verbosity=2)

As per coding guidelines, "Never call wp.clear_kernel_cache() or wp.clear_lto_cache() in test files—not in __main__ blocks, test methods, or module scope. Cache clearing is not multi-process-safe; concurrent clears cause LLVM crashes."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@warp/tests/test_deterministic.py` around lines 1190 - 1192, Remove the
explicit wp.clear_kernel_cache() call from the module-level __main__ block in
this test file; keep only the unittest.main(verbosity=2) invocation so the test
can run without clearing kernel caches (do not add wp.clear_lto_cache() either).
Locate the __main__ guard containing wp.clear_kernel_cache() and delete that
call, ensuring the block now simply calls unittest.main(...) and nothing else.
warp/_src/deterministic.py (1)

228-231: ⚠️ Potential issue | 🟠 Major

Counter scratch buffers still hard-code wp.int32.

CounterTarget.value_ctype is tracked, but contrib and prefix are always allocated as wp.int32. Any deterministic wp.int64/wp.uint64 counter will silently wrap in Phase 0/1. Please make these scratch buffers follow value_ctype, or fail fast for non-32-bit counters until the wider ABI is wired through end-to-end.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@warp/_src/deterministic.py` around lines 228 - 231, The loop that allocates
counter scratch buffers currently hardcodes dtype=warp.int32 for contrib and
prefix; change the allocation to use each CounterTarget's value_ctype (e.g., use
dtype=target.value_ctype or map that C type to the corresponding warp dtype) so
contrib and prefix match CounterTarget.value_ctype and avoid silent wrapping, or
add an explicit fast-fail that raises if target.value_ctype is not a 32-bit
integer type (when full 64-bit ABI isn't supported). Update the allocations in
the for _target in counter_targets block (the contrib and prefix
warp.zeros/warp.empty calls) to derive dtype from the target's value_ctype and
keep shape= (dim_size,) and device=device.
🧹 Nitpick comments (2)
warp/tests/test_deterministic.py (1)

850-879: test_module_option_override() doesn't prove the override is active on CUDA.

A single approximate-sum assertion will pass whether this kernel took the deterministic path or the normal atomic path, especially on CPU where both paths are deterministic. Please mirror test_kernel_decorator_override() here: run the kernel several times on CUDA and assert bit-exact equality across runs, then keep the sum check as a secondary sanity check if you want.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@warp/tests/test_deterministic.py` around lines 850 - 879,
test_module_option_override currently only checks approximate sums so it doesn't
verify the per-module deterministic=True override on CUDA; update the test
(function test_module_option_override) to mirror test_kernel_decorator_override
by: when device is CUDA (check device == "cuda" or via wp.get_device_name/device
backend), run the kernel per_kernel_det multiple times (e.g., 3+ runs) into
separate outputs and assert bit-exact equality of the outputs across runs (use
output.numpy() equality checks) to prove determinism, while retaining the
existing sum/assert_allclose sanity check as a secondary assertion; ensure you
still toggle wp.config.deterministic = False around the launch to verify the
per-module override takes effect.
warp/_src/deterministic.py (1)

193-193: Annotate device with DeviceLike.

These helper signatures take device untyped, which diverges from the repo's device-parameter convention and makes this launch path harder to type-check consistently.

As per coding guidelines, "Use DeviceLike type annotation (from warp._src.context) for device parameters. Import under TYPE_CHECKING to avoid circular imports."

Also applies to: 222-222, 235-235

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@warp/_src/deterministic.py` at line 193, The device parameter in
allocate_scatter_buffers should be annotated with DeviceLike: import DeviceLike
from warp._src.context inside a TYPE_CHECKING block (from typing import
TYPE_CHECKING) to avoid circular imports, then change the function signature to
accept device: DeviceLike; apply the same change to the other two nearby helper
functions in this module that take a device parameter so all device-typed helper
signatures use DeviceLike consistently.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@warp/_src/context.py`:
- Around line 7510-7526: Before calling _launch_deterministic in launch, guard
the deterministic path with the same null checks used by the regular CUDA path:
verify self.kernel is not None and that self.hooks.forward exists; if either
check fails raise a clear Python-side RuntimeError (or appropriate exception)
with a descriptive message instead of calling _launch_deterministic with a null
kernel handle. Apply the same change to DeterministicLaunch.launch to keep
behavior consistent.
- Around line 2198-2205: ModuleHasher.hash_kernel currently only hashes
kernel.key and kernel.adj which allows reuse of stale cached modules when
per-kernel overrides (e.g., kernel.options like deterministic or
deterministic_max_records) change; update ModuleHasher.hash_kernel to
incorporate kernel.options (or at minimum the deterministic and
deterministic_max_records flags/values) into the computed hash so that
Module.load and the unique-module cache distinguish kernels with different
per-kernel deterministic overrides, and ensure any code paths that call
hash_kernel (e.g., Module.load) will therefore generate distinct hashes when
those option values differ.

In `@warp/_src/deterministic.py`:
- Around line 68-74: CounterTarget is currently keyed only by array_var_label
causing get_or_create_counter_target() to collapse all logical counters (e.g.,
wp.atomic_add(counter, bucket, 1) across different bucket indices) into one
global sequence; fix by including the logical index in the CounterTarget
key/metadata (add a field for the logical counter index and propagate it from
the codegen caller into get_or_create_counter_target()), or alternatively
enforce deterministic mode by detecting non-constant-zero indices in
wp.atomic_add (and reject/raise an error) so Phase 1 cannot hand out global
offsets for variable indices; update all related uses (CounterTarget class,
get_or_create_counter_target(), and the call sites that build Pattern B
metadata) to reflect the new keying or the rejection behavior.

---

Duplicate comments:
In `@design/deterministic-execution.md`:
- Around line 114-115: The doc currently mixes scan conventions: update both
occurrences to use the same convention for wp.utils.array_scan(contrib, prefix,
inclusive=False) and the writeback rule; either change the function call to
inclusive=True and state the total is prefix[-1], or keep inclusive=False and
change the total calculation text (the section that currently says the total
comes from the last element of an inclusive scan) to the exclusive rule: total =
prefix[-1] + contrib[-1]; make this consistent for the
wp.utils.array_scan(contrib, prefix, inclusive=False) mention and the later
total computation paragraph.

In `@warp/_src/context.py`:
- Around line 7628-7641: The code omits the stream argument and overwrites
existing counter values; update calls to warp._src.utils.array_scan(contrib,
prefix, inclusive=False) and the inclusive scan
warp._src.utils.array_scan(contrib, inclusive_out, inclusive=True) as well as
warp.copy and any run_sort_reduce invocations to pass the current non-default
stream (preserve graph capture/determinism), and change the write into
counter_arr (found by iterating kernel.adj.args for ct.array_var_label and
obtained from fwd_args[j]) to add the inclusive_out last-element to the existing
counter value instead of replacing it (i.e., read counter_arr[0], add
inclusive_out[dim_size-1], and write the sum back using warp.copy on the given
stream); apply the same stream+accumulate fix to the similar block around the
other occurrence near the 7670 location.
- Around line 7505-7508: The deterministic replay bug: set_param_at_index only
updates fwd_args but inherited set_param_at_index_from_ctype /
set_params_from_ctypes can replace raw C-type array descriptors in self.params
leaving fwd_args stale, which breaks _launch_deterministic and later
warp.copy/run_sort_reduce replay; to fix, either (A) make DeterministicLaunch
reject/raise when set_param_at_index_from_ctype or set_params_from_ctypes is
used with array-ctypes, or (B) ensure those ctype-updating paths also update the
corresponding Python owning Warp array stored in fwd_args (and keep self.params
and fwd_args synchronized) so that set_param_at_index_from_ctype /
set_params_from_ctypes maintain alignment with fwd_args before
_launch_deterministic runs.

In `@warp/_src/deterministic.py`:
- Around line 228-231: The loop that allocates counter scratch buffers currently
hardcodes dtype=warp.int32 for contrib and prefix; change the allocation to use
each CounterTarget's value_ctype (e.g., use dtype=target.value_ctype or map that
C type to the corresponding warp dtype) so contrib and prefix match
CounterTarget.value_ctype and avoid silent wrapping, or add an explicit
fast-fail that raises if target.value_ctype is not a 32-bit integer type (when
full 64-bit ABI isn't supported). Update the allocations in the for _target in
counter_targets block (the contrib and prefix warp.zeros/warp.empty calls) to
derive dtype from the target's value_ctype and keep shape= (dim_size,) and
device=device.

In `@warp/tests/test_deterministic.py`:
- Around line 1190-1192: Remove the explicit wp.clear_kernel_cache() call from
the module-level __main__ block in this test file; keep only the
unittest.main(verbosity=2) invocation so the test can run without clearing
kernel caches (do not add wp.clear_lto_cache() either). Locate the __main__
guard containing wp.clear_kernel_cache() and delete that call, ensuring the
block now simply calls unittest.main(...) and nothing else.

---

Nitpick comments:
In `@warp/_src/deterministic.py`:
- Line 193: The device parameter in allocate_scatter_buffers should be annotated
with DeviceLike: import DeviceLike from warp._src.context inside a TYPE_CHECKING
block (from typing import TYPE_CHECKING) to avoid circular imports, then change
the function signature to accept device: DeviceLike; apply the same change to
the other two nearby helper functions in this module that take a device
parameter so all device-typed helper signatures use DeviceLike consistently.

In `@warp/tests/test_deterministic.py`:
- Around line 850-879: test_module_option_override currently only checks
approximate sums so it doesn't verify the per-module deterministic=True override
on CUDA; update the test (function test_module_option_override) to mirror
test_kernel_decorator_override by: when device is CUDA (check device == "cuda"
or via wp.get_device_name/device backend), run the kernel per_kernel_det
multiple times (e.g., 3+ runs) into separate outputs and assert bit-exact
equality of the outputs across runs (use output.numpy() equality checks) to
prove determinism, while retaining the existing sum/assert_allclose sanity check
as a secondary assertion; ensure you still toggle wp.config.deterministic =
False around the launch to verify the per-module override takes effect.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: f76535da-266c-45ce-9df4-48647bb5e6f1

📥 Commits

Reviewing files that changed from the base of the PR and between 48e7207 and 3170255.

📒 Files selected for processing (4)
  • design/deterministic-execution.md
  • warp/_src/context.py
  • warp/_src/deterministic.py
  • warp/tests/test_deterministic.py

Comment thread warp/_src/context.py Outdated
Comment thread warp/_src/context.py Outdated
Comment thread warp/_src/deterministic.py

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (6)
warp/_src/context.py (3)

7532-7535: ⚠️ Potential issue | 🟠 Major

Keep fwd_args synchronized for ctype-based param mutators too.

Line 7532 only syncs self.fwd_args for set_param_at_index(). Calls through set_param_at_index_from_ctype() / set_params_from_ctypes() can still leave fwd_args stale, so deterministic replay can post-process the wrong arrays.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@warp/_src/context.py` around lines 7532 - 7535, The fwd_args list is only
updated in set_param_at_index, so calls that mutate params via ctype helpers
leave fwd_args stale; modify set_param_at_index_from_ctype and
set_params_from_ctypes to also synchronize self.fwd_args (either by delegating
to set_param_at_index for each change or by applying the same
index-check-and-assign logic used there), honoring the adjoint flag so adjoint
updates do not overwrite fwd_args.

7541-7553: ⚠️ Potential issue | 🟡 Minor

Preserve the forward-hook null guard in deterministic launch routing.

These deterministic branches call _launch_deterministic() without the regular forward-hook existence check. If hooks.forward is missing, this can pass a null kernel handle to CUDA launch instead of raising the clearer Python-side error.

Also applies to: 7825-7854

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@warp/_src/context.py` around lines 7541 - 7553, The deterministic launch
branches call _launch_deterministic(...) without checking for hooks.forward,
which can allow a null kernel handle to reach CUDA; add the same forward-hook
null guard used in the non-deterministic path: verify hooks.forward (or
equivalent forward kernel handle) is present before invoking
_launch_deterministic, and if missing raise the same Python-side error/exception
used elsewhere so we fail early and clearly; apply the same fix to the other
deterministic branch around the second occurrence noted (near the 7825-7854
block).

7657-7670: ⚠️ Potential issue | 🟠 Major

Two-pass deterministic path still breaks explicit-stream ordering and counter continuity.

Line 7657/Line 7664 scans, Line 7670 counter writeback, and Line 7699 sort-reduce are not stream-bound in this path. On non-current streams/capture this can reorder work. Also, Line 7670 overwrites the counter with this-launch total instead of preserving and adding the pre-existing counter value.

Also applies to: 7699-7699

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@warp/_src/context.py` around lines 7657 - 7670, The two-pass deterministic
path uses array_scan and warp.copy without binding them to the current
stream/capture and overwrites the counter rather than adding to the pre-existing
value; update the logic around warp._src.utils.array_scan(contrib, ...), the
inclusive_out buffer, and the writeback that finds counter_arr via
kernel.adj.args and ct.array_var_label so that (1) all scans, the inclusive_out
usage, the warp.copy writeback, and the later sort-reduce are submitted/bound to
the same current stream/capture to preserve ordering, and (2) when writing the
total into counter_arr you first read the existing counter value from
counter_arr and add the inclusive_out last element (the this-launch total) to it
(or perform an atomic-add equivalent) instead of blindly overwriting; ensure the
same stream-binding fix is applied to the sort-reduce at the other location
(around the existing sort-reduce call).
design/deterministic-execution.md (1)

129-130: ⚠️ Potential issue | 🟡 Minor

Use one scan convention throughout the design doc.

Pattern B says the prefix pass uses inclusive=False, but the writeback section says the total comes from “the last element of the inclusive scan.” Those are different rules, so the doc currently describes two incompatible implementations.

Also applies to: 191-193

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@design/deterministic-execution.md` around lines 129 - 130, The doc
inconsistently specifies scan conventions: Pattern B states the prefix pass uses
wp.utils.array_scan(contrib, prefix, inclusive=False) while the writeback
section refers to “the last element of the inclusive scan”; pick one convention
and update all references to match. Concretely, decide whether prefix scans are
inclusive or exclusive, then change mentions in Pattern B, the writeback
section, and other occurrences (e.g., lines ~191-193) so they all reference the
same wp.utils.array_scan(..., inclusive=...) behavior and explain how the total
is derived (either from the last element of the inclusive scan or from
last+contrib for exclusive).
warp/_src/deterministic.py (2)

283-293: ⚠️ Potential issue | 🟠 Major

Match the counter scratch buffers to the counter width.

CounterTarget.value_ctype is tracked, but allocate_counter_buffers() always allocates wp.int32 for both scratch arrays. That will truncate deterministic int64/uint64 counters unless they are rejected earlier. Either allocate these buffers from value_ctype, or fail fast for non-32-bit counters.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@warp/_src/deterministic.py` around lines 283 - 293, The counter scratch
buffers are always allocated as warp.int32 in allocate_counter_buffers, which
will truncate 64-bit counters; update allocate_counter_buffers to use each
CounterTarget's value_ctype (CounterTarget.value_ctype) when creating contrib
and prefix (i.e., pass dtype=_target.value_ctype to warp.zeros/warp.empty) or
alternatively raise/assert if _target.value_ctype is not a 32-bit type so
non-32-bit counters fail fast; ensure you reference the function
allocate_counter_buffers and the CounterTarget.value_ctype field when making the
change.

189-200: ⚠️ Potential issue | 🟠 Major

Don't merge every logical counter in an array into one deterministic target.

get_or_create_counter_target() only keys on array_var_label. If deterministic mode ever sees wp.atomic_add(counter, bucket, 1), every bucket in that array will reuse one contrib/prefix stream and Phase 1 will hand out slots from a single global sequence. Please include the logical counter index in the target identity, or reject non-constant-zero counter indices before creating a CounterTarget.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@warp/_src/deterministic.py` around lines 189 - 200,
get_or_create_counter_target currently deduplicates targets solely by
array_var_label which causes all array elements to share one deterministic
counter; update get_or_create_counter_target to include the logical counter
index in the identity (e.g., use a tuple of (array_var_label, logical_index) or
add a logical_index field to CounterTarget) so each distinct bucket gets its own
CounterTarget, and before creating a new CounterTarget validate the provided
index expression is a constant zero (or reject/non-deterministic indices) if
your design only allows index 0; ensure you update the lookup loop to compare
the new key (array_var_label plus logical index) and append the new
CounterTarget with the logical index populated when creating it.
🧹 Nitpick comments (2)
warp/_src/context.py (1)

1339-1341: Validate deterministic_max_records early for type/range.

This value is currently accepted broadly and coerced later in launch. Adding an upfront int + non-negative validation in option ingestion would fail fast and avoid silent coercion paths.

Also applies to: 2583-2585

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@warp/_src/context.py` around lines 1339 - 1341, Validate
deterministic_max_records immediately before assigning into kernel_options:
ensure the value is an int (or can be safely converted) and is >= 0, raising a
TypeError for non-int types and a ValueError for negative values so callers fail
fast; then assign kernel_options["deterministic_max_records"] =
deterministic_max_records only after validation. Apply the same validation logic
at the other ingestion site referenced around lines 2583-2585 to keep behavior
consistent.
warp/tests/test_unique_module.py (1)

170-205: Keep the hash assertion runnable on CPU-only jobs.

Line 172 skips the whole test, but the module.name comparison at Lines 189-193 is device-independent. Splitting this into a CPU-safe hashing assertion plus a CUDA-only launch check would preserve coverage for the unique-module hashing change on non-CUDA runners.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@warp/tests/test_unique_module.py` around lines 170 - 205, The test currently
skips the entire test when CUDA is unavailable, but the module name hashing
assertion in test_kernel_options_affect_unique_module_identity (comparing
_scatter_normal.module.name and _scatter_deterministic.module.name) is
device-independent; change the test to perform the module.name comparison
unconditionally, and only guard the CUDA-specific array creation and wp.launch
calls with if not wp.is_cuda_available(): self.skipTest(...) or conditional
blocks around the CUDA-only code (values/indices/out_* with device="cuda:0" and
the wp.launch calls) so the hashing assertion still runs on CPU-only CI while
the deterministic launch checks remain CUDA-only.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@asv/benchmarks/atomics.py`:
- Around line 242-243: The class attribute params (and param_names) in the
benchmark classes are defined as lists which creates mutable class state; change
them to tuples so they are immutable (e.g., replace params = (["normal",
"deterministic"], [1, 65536], DETERMINISTIC_BENCHMARK_SIZES) with a
tuple-of-tuples and param_names likewise) and apply the same change to the
second occurrence around lines 317-318; update the definitions referenced as
params and param_names in asv/benchmarks/atomics.py so both benchmark classes
use immutable tuples instead of lists.

In `@warp/_src/codegen.py`:
- Around line 1785-1788: The early-return that skips codegen for integer atomics
(the condition using is_float_type(scalar_dtype) and return_is_consumed) is
unsafe when a counting pass exists; modify that condition to also detect whether
the current kernel contains any consumed/counter atomic (i.e., only return None
when not is_float_type(scalar_dtype) AND not return_is_consumed AND there is no
consumed/counter atomic in the kernel). Use the existing kernel/AST context or
add a predicate (e.g., has_consumed_counter_atomic) to check for consumed
atomics before returning so phase 0 will not double-apply native integer atomics
in mixed kernels.

---

Duplicate comments:
In `@design/deterministic-execution.md`:
- Around line 129-130: The doc inconsistently specifies scan conventions:
Pattern B states the prefix pass uses wp.utils.array_scan(contrib, prefix,
inclusive=False) while the writeback section refers to “the last element of the
inclusive scan”; pick one convention and update all references to match.
Concretely, decide whether prefix scans are inclusive or exclusive, then change
mentions in Pattern B, the writeback section, and other occurrences (e.g., lines
~191-193) so they all reference the same wp.utils.array_scan(..., inclusive=...)
behavior and explain how the total is derived (either from the last element of
the inclusive scan or from last+contrib for exclusive).

In `@warp/_src/context.py`:
- Around line 7532-7535: The fwd_args list is only updated in
set_param_at_index, so calls that mutate params via ctype helpers leave fwd_args
stale; modify set_param_at_index_from_ctype and set_params_from_ctypes to also
synchronize self.fwd_args (either by delegating to set_param_at_index for each
change or by applying the same index-check-and-assign logic used there),
honoring the adjoint flag so adjoint updates do not overwrite fwd_args.
- Around line 7541-7553: The deterministic launch branches call
_launch_deterministic(...) without checking for hooks.forward, which can allow a
null kernel handle to reach CUDA; add the same forward-hook null guard used in
the non-deterministic path: verify hooks.forward (or equivalent forward kernel
handle) is present before invoking _launch_deterministic, and if missing raise
the same Python-side error/exception used elsewhere so we fail early and
clearly; apply the same fix to the other deterministic branch around the second
occurrence noted (near the 7825-7854 block).
- Around line 7657-7670: The two-pass deterministic path uses array_scan and
warp.copy without binding them to the current stream/capture and overwrites the
counter rather than adding to the pre-existing value; update the logic around
warp._src.utils.array_scan(contrib, ...), the inclusive_out buffer, and the
writeback that finds counter_arr via kernel.adj.args and ct.array_var_label so
that (1) all scans, the inclusive_out usage, the warp.copy writeback, and the
later sort-reduce are submitted/bound to the same current stream/capture to
preserve ordering, and (2) when writing the total into counter_arr you first
read the existing counter value from counter_arr and add the inclusive_out last
element (the this-launch total) to it (or perform an atomic-add equivalent)
instead of blindly overwriting; ensure the same stream-binding fix is applied to
the sort-reduce at the other location (around the existing sort-reduce call).

In `@warp/_src/deterministic.py`:
- Around line 283-293: The counter scratch buffers are always allocated as
warp.int32 in allocate_counter_buffers, which will truncate 64-bit counters;
update allocate_counter_buffers to use each CounterTarget's value_ctype
(CounterTarget.value_ctype) when creating contrib and prefix (i.e., pass
dtype=_target.value_ctype to warp.zeros/warp.empty) or alternatively
raise/assert if _target.value_ctype is not a 32-bit type so non-32-bit counters
fail fast; ensure you reference the function allocate_counter_buffers and the
CounterTarget.value_ctype field when making the change.
- Around line 189-200: get_or_create_counter_target currently deduplicates
targets solely by array_var_label which causes all array elements to share one
deterministic counter; update get_or_create_counter_target to include the
logical counter index in the identity (e.g., use a tuple of (array_var_label,
logical_index) or add a logical_index field to CounterTarget) so each distinct
bucket gets its own CounterTarget, and before creating a new CounterTarget
validate the provided index expression is a constant zero (or
reject/non-deterministic indices) if your design only allows index 0; ensure you
update the lookup loop to compare the new key (array_var_label plus logical
index) and append the new CounterTarget with the logical index populated when
creating it.

---

Nitpick comments:
In `@warp/_src/context.py`:
- Around line 1339-1341: Validate deterministic_max_records immediately before
assigning into kernel_options: ensure the value is an int (or can be safely
converted) and is >= 0, raising a TypeError for non-int types and a ValueError
for negative values so callers fail fast; then assign
kernel_options["deterministic_max_records"] = deterministic_max_records only
after validation. Apply the same validation logic at the other ingestion site
referenced around lines 2583-2585 to keep behavior consistent.

In `@warp/tests/test_unique_module.py`:
- Around line 170-205: The test currently skips the entire test when CUDA is
unavailable, but the module name hashing assertion in
test_kernel_options_affect_unique_module_identity (comparing
_scatter_normal.module.name and _scatter_deterministic.module.name) is
device-independent; change the test to perform the module.name comparison
unconditionally, and only guard the CUDA-specific array creation and wp.launch
calls with if not wp.is_cuda_available(): self.skipTest(...) or conditional
blocks around the CUDA-only code (values/indices/out_* with device="cuda:0" and
the wp.launch calls) so the hashing assertion still runs on CPU-only CI while
the deterministic launch checks remain CUDA-only.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 493e06fd-9309-43dc-ad53-7cc5a97aa0b8

📥 Commits

Reviewing files that changed from the base of the PR and between 3170255 and e34a93c.

📒 Files selected for processing (11)
  • asv/benchmarks/atomics.py
  • design/deterministic-execution.md
  • warp/_src/codegen.py
  • warp/_src/context.py
  • warp/_src/deterministic.py
  • warp/config.py
  • warp/native/deterministic.cpp
  • warp/native/deterministic.cu
  • warp/native/warp.h
  • warp/tests/test_deterministic.py
  • warp/tests/test_unique_module.py
🚧 Files skipped from review as they are similar to previous changes (3)
  • warp/native/warp.h
  • warp/native/deterministic.cpp
  • warp/config.py

Comment thread asv/benchmarks/atomics.py Outdated
Comment thread warp/_src/codegen.py Outdated
@eric-heiden

eric-heiden commented Apr 16, 2026

Copy link
Copy Markdown
Member

This determinism mode doesn't yet support atomics inside Warp functions, which is used in MJWarp, for example in smooth.py:

  • _accumulate_jac_chain()
  • _accumulate_jac_dot_chain()

Minimal repro:

import warp as wp

wp.init()

a = wp.zeros(1, dtype=wp.float32, device="cuda")

@wp.func
def add_one(x: wp.array(dtype=wp.float32)):
    wp.atomic_add(x, 0, 1.0)

@wp.kernel
def k(x: wp.array(dtype=wp.float32)):
    add_one(x)

wp.launch(k, dim=1, inputs=[a], device="cuda")
wp.synchronize_device("cuda")
print(a.numpy())

With this PR's deterministic path enabled, atomic_add inline in a @wp.kernel works, but the same atomic inside a @wp.func helper called from a kernel fails NVRTC compilation.

The generated helper function body gets rewritten to use deterministic scatter / debug plumbing, but its signature does not receive the hidden deterministic params that are appended for kernel entry points (_wp_scatter_*, _wp_det_debug, _idx, etc.).

Comment thread warp/_src/context.py Outdated
stevenwman added a commit to stevenwman/jax-learning that referenced this pull request Apr 27, 2026
…nostic

scripts/check_tdmpc2_determinism.py — three subchecks (--check init |
update | env). Each runs in <30s; run twice in separate processes and
diff the printed SHA-256 hashes to localize where non-determinism enters.

.context/lessons/determinism.md — empirical results + root cause +
upstream fix-in-flight tracking. TL;DR:
  - JAX/XLA algo: bit-ID across processes IFF
    XLA_FLAGS=--xla_gpu_deterministic_ops=true
  - mujoco_warp env: NOT bit-ID across processes; XLA flag has no effect
    on Warp CUDA kernels. wp.set_device("cpu") is the only workaround
    today. Fix in NVIDIA/warp#1355 (Warp 1.14, ~Jun 2026) +
    mujoco_warp#1281, #1300.
  - Therefore end-to-end training trajectories cannot be byte-identical
    across two GPU processes. Same-process re-runs appear deterministic.

Rationale for adding now: the TD-MPC2 eval-key isolation fix (b5a8c70)
couldn't be empirically verified by diffing two-process trajectories
because env non-determinism dominates; needed the within-process
key-inspection workaround. Future repeatability questions for any algo
in this repo should consult this lesson.

LESSONS.md updated with index entry.
@eric-heiden eric-heiden force-pushed the warp-deterministic branch from 1744236 to f207179 Compare May 1, 2026 23:33
Comment thread warp/_src/deterministic.py Outdated
Comment thread warp/_src/codegen.py
@mar-yan24

Copy link
Copy Markdown

@mmacklin Hi, I am currently working on playing around with determinism in mujoco_warp and this implementation in Warp itself with base level atomics would be pretty helpful in expanding and simplifying that. I just made a PR to your branch, if you want to take a look, addressing some of the problems here.

@eric-heiden

Copy link
Copy Markdown
Member

Thanks @mar-yan24, I took a closer look at your PR against mmacklin/warp-deterministic.

There are definitely useful fixes in there that we should incorporate selectively:

  • replacing cub::TransformInputIterator with thrust::make_transform_iterator fixes the CUDA 13 build failure in warp/native/deterministic.cu
  • the sliced/view array atomic handling looks relevant for MuJoCo Warp patterns like wp.atomic_add(arr[i], j, value)
  • the backward-kernel deterministic-parameter issue is real and should be fixed
  • the explicit graph-capture error for consumed-return atomic counter/allocator patterns, e.g. slot = wp.atomic_add(counter, 0, 1), is useful because that path currently requires the two-pass count/scan/execute transform and is not capture-safe

I do not think we can merge the PR wholesale, though, because it appears to roll back some of the wp.Struct target handling added here: in particular the attr_path-based target identity, unique helper naming, and struct-field regression tests. We need to preserve support for struct member arrays such as writer.counter / writer.output while adding the sliced-array and backward fixes.

So the plan is to adapt the useful pieces from your PR into this branch while keeping the struct-field support intact. We’ll use a focused integration commit with a Co-authored-by: Mark Yang ... trailer so your contribution is acknowledged in the commit history.

Comment thread warp/_src/codegen.py
@eric-heiden eric-heiden force-pushed the warp-deterministic branch 2 times, most recently from 79c9a14 to a925a72 Compare May 8, 2026 20:20
Comment thread warp/_src/context.py Outdated
@eric-heiden eric-heiden self-assigned this May 8, 2026
Signed-off-by: Eric Heiden <eric-heiden@outlook.com>
Signed-off-by: Eric Heiden <eric-heiden@outlook.com>
Signed-off-by: Eric Heiden <eric-heiden@outlook.com>
Signed-off-by: Eric Heiden <eric-heiden@outlook.com>
Signed-off-by: Eric Heiden <eric-heiden@outlook.com>
Signed-off-by: Eric Heiden <eric-heiden@outlook.com>
Signed-off-by: Eric Heiden <eric-heiden@outlook.com>
Signed-off-by: Eric Heiden <eric-heiden@outlook.com>
Signed-off-by: Eric Heiden <eric-heiden@outlook.com>
Signed-off-by: Eric Heiden <eric-heiden@outlook.com>
…stic

Signed-off-by: Eric Heiden <eric-heiden@outlook.com>

# Conflicts:
#	warp/_src/codegen.py
#	warp/_src/context.py
Signed-off-by: Eric Heiden <eric-heiden@outlook.com>
@eric-heiden eric-heiden force-pushed the warp-deterministic branch from d0284f4 to ff16c40 Compare June 9, 2026 23:37
Signed-off-by: Eric Heiden <eric-heiden@outlook.com>
@johnnynunez

Copy link
Copy Markdown
Contributor

Hi @mmacklin — following up on @mar-yan24's report about mujoco_warp compatibility. I tested the current branch against mjwarp on an RTX PRO 6000 Blackwell (sm_120, CUDA 13.3) and have good news plus a fix to offer.

The sliced atomic pattern (wp.atomic_add(arr[worldid], adr, val)) that was reported broken is fixed on the current branch — all the determinism tests pass here including the 2D/3D slice cases, and mjwarp's forward dynamics runs correctly with DeterministicMode.RUN_TO_RUN outside graph capture.

The remaining blocker was the documented conditional-body-graph limitation (design/deterministic-execution.md: "Deterministic kernels are not supported inside CUDA conditional body graphs"). mjwarp hits it on every step: its solver iterates via wp.capture_while, and launch_deterministic records stream-ordered allocations for scatter/counter buffers and CUB workspaces, which CUDA rejects inside conditional bodies ("Conditional body graph contains an unsupported operation (memory allocation)").

I implemented the workspace approach the design doc anticipates, on top of this branch — PR into your fork: mmacklin#3

  • During capture, deterministic buffer allocations are redirected to a dedicated non-capturing stream under a temporarily relaxed thread capture mode (cudaThreadExchangeStreamCaptureMode), with the allocation stream synchronized before captured work consumes the memory. The captured graph contains zero allocation nodes; lifetime is still tied to the graph via _deterministic_buffer_refs.
  • Since allocation-time initialization no longer replays with the graph, explicit fill_/zero_ resets are recorded on the capturing stream so each replay starts clean.
  • Adds wp_cuda_thread_exchange_capture_mode native API + two capture_while regression tests (scatter and consumed-return counter paths) modeled on the mjwarp solver pattern.

Verified: full determinism suite passes (87 tests incl. the 2 new ones), 647 tests across CodeGen/Launch/Graph/Stream/Array suites pass, and mjwarp now steps correctly under wp.ScopedCapture + capture_while with RUN_TO_RUN, with bit-identical graph replays.

Two smaller observations from mjwarp stress testing, happy to file separately:

  1. Contact-rich scenes overflow the static scatter-record estimate in mjwarp's CCD kernel (Deterministic scatter buffer overflow in kernel 'ccd_kernel'); deterministic_max_records=64 works around it.
  2. With contacts/constraints bitwise-stable, long mjwarp rollouts still diverge after ~100 steps from 1-ulp differences in solver internals (wp.tile_reduce paths) — expected per the mjwarp-side roadmap, just noting it's the next frontier.

cc @mar-yan24 @thowell — with this, the warp-side determinism should be testable end-to-end from mujoco_warp again.

@johnnynunez

Copy link
Copy Markdown
Contributor

Follow-up with deeper measurements on the remaining long-rollout drift I mentioned, since it's relevant to the scope of this PR's guarantees.

wp.tile_reduce is bit-exact run-to-run and is NOT the drift source. I tested it in isolation with adversarial wide-dynamic-range inputs over 200 launches: bitwise identical every time, as expected from the fixed shuffle tree + index-ordered partial combination in tile_reduce.h. One caveat worth documenting: the result is only fixed per launch configuration — changing block_dim (64 → 128) changes the reduction tree and gives different rounding (maxdiff ~0.125 on my adversarial input). That's irrelevant for RUN_TO_RUN but means tile reductions will need a canonical-order strategy if GPU_TO_GPU is ever extended to cover them, since occupancy/block_dim differ across architectures.

Single-step determinism through mujoco_warp is fully bitwise. Replaying one identical mjw.step() 20× from a frozen state: every array in Data (contacts → constraint rows → solver → integration) is bit-identical. The deterministic atomics in this PR are doing their job end-to-end.

The mjwarp long-rollout divergence (~step 110-125 in a 20-body contact-rich pile) is not attributable to this PR. It reproduces identically with NOT_GUARANTEED (~same onset step), survives zeroing all wp.empty allocations and mjwarp's per-step solver context workspaces, and persists with the island solver disabled. Two replays from a bit-identical full-Data reset stay bitwise equal for 100+ steps before splitting inside the solver (efc.Ma/qacc first, with identical efc_J/efc_force/contact inputs and identical solver_niter that step). So the residual nondeterminism is in mjwarp solver internals that don't pass through interceptable wp.atomic_* calls — being chased on the mjwarp side (google-deepmind/mujoco_warp#1281 thread), not a Warp issue.

Net: from mjwarp's perspective, the only blockers to using this PR were the conditional-body-graph allocation issue (fix in mmacklin#3) and the CCD scatter-record sizing mentioned above. The determinism machinery itself checks out under heavy stress testing on sm_120 / CUDA 13.3.

@eric-heiden

Copy link
Copy Markdown
Member

Thanks, @johnnynunez! I've looped in @nvlukasz to take a look at your PR since he implemented the graph conditional feature. This special treatment might be a bit out of scope for now since it touches other parts of the way we do memory allocations in CUDA graphs here. So we could consider adding it later (I want to get this PR merged very soon).

For now, in Newton's SolverMuJoCo, you can simply disable graph conditionals with this setting:

solver.mjw_model.opt.graph_conditional = False

Generally the automatic determinism mode is quite slow for MuJoCo because of the high kernel count (I noticed 10-30x slowdowns in some scenarios I tested). So a better option is to implement custom code for the Newton solvers that will implement a more optimized way of determinism where only a subset of kernels need to be modified.

@johnnynunez

Copy link
Copy Markdown
Contributor

Thanks @eric-heiden — that makes sense, and no objection to deferring mmacklin#3 if it helps land this PR sooner. The branch is there for @nvlukasz whenever graph-conditional support becomes in-scope; opt.graph_conditional = False is a workable interim for Newton's SolverMuJoCo.

Your 10-30x observation matches my measurements. Benchmarked on a 20-body contact pile (graph-captured, sm_120): automatic RUN_TO_RUN costs 3x at 1 world and collapses to ~100x at 256+ worlds (per-launch sort-reduce dominates, and the scatter buffers hit the int32 allocation limit at 1024 worlds × njmax=2048). The custom deterministic kernels in google-deepmind/mujoco_warp#1281 cost only ~9-12% on the same scenes — so fully agreed that per-solver custom kernels are the right architecture for MuJoCo/Newton, with this PR's automatic mode as the correctness baseline to validate against. Posted the full comparison table on that PR.

One small thing that may be worth a docs note or better error before merge: with many worlds the deterministic scatter buffer allocation fails with a raw ValueError: Array shapes must not exceed ... signed 32-bit integer from wp.empty — a dedicated error suggesting deterministic_max_records (or per-kernel opt-out) would make the failure mode clearer.

Signed-off-by: Eric Heiden <eric-heiden@outlook.com>
Signed-off-by: Eric Heiden <eric-heiden@outlook.com>
Signed-off-by: Eric Heiden <eric-heiden@outlook.com>
Implement the Ahrens-style binned summation method for scalar float RUN_TO_RUN reductions. This keeps the fast CUB sort/reduce structure while reducing binned accumulators instead of raw floats, so CUB's reduction tree no longer changes the final result from run to run.

On repro_scalar_failure.py with n=42802 and seed=42925, RUN_TO_RUN produced 1 unique output across 50 runs. The previous timing runs were about 1.39-1.63 ms for RUN_TO_RUN versus 16.5-20.1 ms for GPU_TO_GPU over 500 timed runs.

Signed-off-by: Eric Heiden <eric-heiden@outlook.com>
Signed-off-by: Eric Heiden <eric-heiden@outlook.com>
Signed-off-by: Eric Heiden <eric-heiden@outlook.com>
Signed-off-by: Eric Heiden <eric-heiden@outlook.com>
Signed-off-by: Eric Heiden <eric-heiden@outlook.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add deterministic execution mode for atomic operations

6 participants