Skip to content

fix: restore_snapshot stateful generation hangs — continuation_ids and _stateful_generate flag dropped in upstream merge reconstruction #19

@KHAEntertainment

Description

@KHAEntertainment

Problem

POST /restore_snapshot with create_new_request=True and continuation_ids hangs indefinitely. The HTTP future in tokenizer_manager.restore_snapshot blocks on snapshot_restore_result_queue.get() and never resolves. Response always returns rid=null, output_text=null.

Affects all models. Step 5 (stateful recall) is BLOCKED in every compat protocol run.

Root Cause

The working implementation existed on the A100 (Phase 8: 4/4 PASS, commit 3917c1231 in the runpod backup). It had two components:

1. scheduler.py handle_restore_snapshot — when continuation_ids is present:

  • Appends continuation tokens to origin_input_ids
  • Sets new_req._stateful_generate = True
  • Applies recv_req.max_new_tokens to SamplingParams
  • Returns None (deferred — generation completes async)

2. scheduler_output_processor_mixin.py — on request finish:

  • Detects req._stateful_generate == True
  • Sends RestoreSnapshotReqOutput(success=True, rid=..., output_ids=[...]) via send_to_tokenizer
  • Tokenizer manager detokenizes output_idsoutput_text and returns to HTTP caller

During the upstream merge (PR #15/#16), Phase 8 was reconstructed from a lost A100 session. The mixin (part 2) was ported correctlyscheduler_output_processor_mixin.py:1023 has the _stateful_generate check. But scheduler.py (part 1) was reconstructed without the continuation_ids path, so _stateful_generate is never set to True, the mixin's output routing is dead code, and the queue never unblocks.

Evidence

  • Working code source: /home/jeanclawdai/runpod-backup/restore/repo at commit 857dd02a6 (latest Phase 8 branch tip)
  • Current scheduler.py: grep continuation_ids → no results
  • Current mixin scheduler_output_processor_mixin.py:1023: _stateful_generate check present but never triggered
  • test/registered/radix_cache/test_mamba_stateful_inference.py — all 4 tests hang at restore_snapshot call
  • Confirmed broken across: granite-tiny, granite-small, Nemotron-Cascade-2-30B, Qwen3-Coder-Next

Fix

Restore the create_new_request=True path in handle_restore_snapshot (scheduler.py):

  1. Read recv_req.continuation_ids, append to origin_input_ids when present
  2. Apply recv_req.max_new_tokens to SamplingParams if provided
  3. Set new_req._stateful_generate = stateful_generate (bool flag)
  4. Return None when stateful_generate=True (deferred), RestoreSnapshotReqOutput otherwise

Also verify RestoreSnapshotReqInput (in io_struct.py) has continuation_ids and max_new_tokens fields.

Verification

pytest test/registered/radix_cache/test_mamba_stateful_inference.py -v
# Expected: 4/4 PASS on granite-4.0-h-tiny
# --enable-snapshot-persistence --mamba-scheduler-strategy no_buffer

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions