Skip to content

GH-560: Wire wgpu fallback into batch inference path #560

@noahgift

Description

@noahgift

Five-Whys

  1. Why doesn't batch mode use wgpu?init_batch_model() in batch.rs only tries OwnedQuantizedModelCuda. When CUDA parity fails, it falls to CPU.
  2. Why no wgpu in batch init? → The BatchModel struct only has gpu: Option<OwnedQuantizedModelCuda> and cpu: Option<OwnedQuantizedModel>. No wgpu variant.
  3. Why no wgpu variant? → Batch was designed before wgpu inference existed.
  4. Why does this matter? → 32B model in worker mode times out (316s per problem). Batch mode loads model once. Without wgpu batch, 32B eval is CPU-only (slow) or worker-mode (timeouts).
  5. Fix: Add wgpu to init_batch_model — when CUDA parity fails, try wgpu before CPU.

Contract

gpu-multi-backend-parity-v1.yaml equation backend_priority:

select(backends) = first(b in [cuda, wgpu, cpu] where parity(b) >= 0.98)

Currently violated in batch path — batch skips wgpu entirely.

Acceptance Criteria

  • apr run model.apr --batch-jsonl prompts.jsonl --gpu uses wgpu when CUDA parity fails
  • Batch output shows "used_gpu": true with wgpu backend
  • 32B MBPP eval completes without timeouts

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions