Skip to content

fix: request admission edge cases#685

Merged
eric-tramel merged 3 commits into
scheduling-yolofrom
codex/fix-yolo-request-admission
May 19, 2026
Merged

fix: request admission edge cases#685
eric-tramel merged 3 commits into
scheduling-yolofrom
codex/fix-yolo-request-admission

Conversation

@nabinchha

Copy link
Copy Markdown
Contributor

Summary: requires exact request admission leases on release; pins AIMD to one multiplicative cut per rate-limit cascade; enables request-pressure advisory in DatasetBuilder async scheduler wiring; updates docs to remove public ThrottleConfig guidance. Tests: request admission controller, async builder integration, RunConfig, full async scheduler, and git diff --check.

@nabinchha nabinchha requested a review from a team as a code owner May 19, 2026 21:12
@github-actions

github-actions Bot commented May 19, 2026

Copy link
Copy Markdown
Contributor

MkDocs preview: https://cbed668e.dd-docs-preview.pages.dev

Fern preview: https://nvidia-preview-pr-685.docs.buildwithfern.com/nemo/datadesigner​

Notebook tutorials are rendered without execution outputs in previews.

@nabinchha nabinchha changed the title Fix request admission edge cases fix: request admission edge cases May 19, 2026
@greptile-apps

greptile-apps Bot commented May 19, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR tightens two AIMD edge cases in AdaptiveRequestAdmissionController and wires up request-pressure advisory in the async scheduler. The core logic changes are in controller.py and are well-covered by three new focused tests.

  • Stale-lease guard (controller.py line 365): the guard now compares the full frozen-dataclass lease for equality (active != lease) instead of only the item.resource field, so any caller that passes back a modified copy of the lease (e.g., via dataclasses.replace) is correctly rejected without corrupting in-flight counters.
  • AIMD decrease gate (_apply_outcome): replaces first_in_cascade = (consecutive_rate_limits == 0) with should_decrease = (admitted_adaptive_limit <= prev_limit), using the limit snapshotted at acquisition time. This fixes a latent bug where, after the first burst cascade, a fresh rate-limit on a newly-admitted lease would never re-trigger a multiplicative decrease because consecutive_rate_limits was never 0 again. The rate_limit_ceiling is likewise anchored to the admitted limit rather than prev_limit, making the ceiling reflect the concurrency level the provider actually saw when it rate-limited.
  • dataset_builder.py: adds request_pressure_advisory=True to the AsyncTaskScheduler constructor; a new monkeypatch integration test confirms the kwarg is forwarded.

Confidence Score: 5/5

Safe to merge — all three changed code paths have direct, passing tests and the logic is consistent with the documented intent.

Both controller fixes (stale-lease equality check and AIMD decrease gate) are provably correct: the frozen-dataclass equality comparison is exact by construction because the stored lease object is the same instance returned to the caller, and the new admitted_adaptive_limit <= prev_limit guard correctly handles the previously-broken case where consecutive_rate_limits was never reset between cascades. The dataset-builder change is a single kwarg addition with a dedicated integration test.

No files require special attention.

Important Files Changed

Filename Overview
packages/data-designer-engine/src/data_designer/engine/models/request_admission/controller.py Two related fixes: stale-lease guard is tightened to full dataclass equality (was resource-only), and AIMD decrease condition replaces first_in_cascade with admitted_adaptive_limit <= prev_limit so subsequent cascades can still decrease the limit after the first burst has already cut it.
packages/data-designer-engine/src/data_designer/engine/dataset_builders/dataset_builder.py Single-line change enabling request_pressure_advisory=True in the AsyncTaskScheduler constructor wiring; straightforward and now covered by test.
packages/data-designer-engine/tests/engine/models/request_admission/test_controller.py Adds three new tests covering exact-lease stale rejection, single-decrease-per-burst, and re-decrease after a fresh cascade; all cases align with the new controller semantics.
packages/data-designer-engine/tests/engine/dataset_builders/test_async_builder_integration.py Adds a monkeypatch-based integration test confirming request_pressure_advisory=True is forwarded to AsyncTaskScheduler by _prepare_async_run; approach is sound.

Sequence Diagram

sequenceDiagram
    participant Caller
    participant RAC as AdaptiveRequestAdmissionController
    participant State as AdaptiveRequestLimitState

    Caller->>RAC: try_acquire(item)
    RAC->>RAC: _acquire_locked(item, now)
    Note over RAC: snapshot current_limit → lease.current_adaptive_limit
    RAC-->>Caller: "RequestAdmissionLease (lease_id, current_adaptive_limit=N)"

    Caller->>RAC: release(lease, outcome)
    RAC->>RAC: "active = _active_leases.pop(lease_id)"
    alt "active != lease (stale: any field mismatch)"
        RAC->>RAC: "_active_leases[lease_id] = active  (restore)"
        RAC-->>Caller: "ReleaseResult(released=False, reason=stale_lease)"
    else exact match
        RAC->>RAC: _apply_outcome(state, resource, active.current_adaptive_limit, outcome, now, events)
        alt "outcome == rate_limited"
            RAC->>State: "consecutive_rate_limits += 1"
            RAC->>RAC: "should_decrease = (admitted_adaptive_limit <= state.current_limit)"
            alt should_decrease
                RAC->>State: "current_limit = floor(current_limit * MD_factor)"
                RAC->>State: "rate_limit_ceiling = min(ceiling, admitted_adaptive_limit)"
            end
        else "outcome == success (cooldown cleared)"
            RAC->>State: "consecutive_rate_limits = 0"
            RAC->>State: "success_streak += 1 → maybe increase current_limit"
        end
        RAC-->>Caller: "ReleaseResult(released=True, reason=released)"
    end
Loading

Reviews (2): Last reviewed commit: "revert request admission docs changes" | Re-trigger Greptile

Signed-off-by: Eric W. Tramel <eric.tramel@gmail.com>
Signed-off-by: Eric W. Tramel <eric.tramel@gmail.com>
@eric-tramel eric-tramel merged commit 70974fd into scheduling-yolo May 19, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants