Skip to content

feat(rosetta): add IBM Granite model-family contract (closes #1588)#1659

Merged
noahgift merged 17 commits into
mainfrom
fix/1588-granite-rosetta-family
May 13, 2026
Merged

feat(rosetta): add IBM Granite model-family contract (closes #1588)#1659
noahgift merged 17 commits into
mainfrom
fix/1588-granite-rosetta-family

Conversation

@noahgift

Copy link
Copy Markdown
Contributor

Summary

Closes #1588. Adds `contracts/model-families/granite.yaml` for IBM Granite 3.x dense models so apr-cookbook architecture-demos flips Granite from `status: blocked` → covered.

Why no engine change

Granite 3.x dense models follow LLaMA-3 architecture (GQA + RoPE + SwiGLU + RMSNorm) with IBM's 49152-token vocab and `tie_word_embeddings: true`. The existing machinery already supports it:

  • `from_model_type("granite" | "granite3")` → `Architecture::Llama` (tensor_expectation.rs:142)
  • `kernel_explain/resolve.rs` aliases `granite` → `GraniteForCausalLM` (resolve.rs:23)
  • Llama tensor name mapper handles the names

So this PR is YAML-only.

Sizes covered

Variant params hidden layers heads kv_heads inter rope_theta
2b 2.5B 2048 40 32 8 8192 1e7
8b 8B 4096 40 32 8 12800 1e7

Both reference HF `ibm-granite/granite-3.1-{2b,8b}-base` config.json.

Out of scope

  • Granite MoE variants (`granite-3.0-3b-a800m-*`) use `GraniteMoeForCausalLM`, a distinct architecture.
  • Granite 4.0 / 3.3 minor variants — same structure, separate size additions if needed.

Test plan

  • `pv validate contracts/model-families/granite.yaml` — 0 errors
  • FALSIFY-PARITY-002 `test_every_model_family_yaml_has_architecture` passes
  • CI: workspace-test

🤖 Generated with Claude Code

noahgift and others added 11 commits May 13, 2026 09:33
…w-major (was [K,N]); MODEL-1 → 100% (PMAT-CODE-SHIP-007-F32-GEMV-LAYOUT-FIX)

§74 localized the SHIP-007 PARITY-GATE bug to f32_gemv_into via PR-B's
stage-bisection scaffold (CPU vs GPU per-stage statistics analysis).
The F32 GEMV PTX kernel was reading weights with TRANSPOSED layout
interpretation:

Bug: kernel assumed A is K-rows × N-cols row-major (A[i,j] at i*N+j),
     but actual ML weights are stored [output_dim=N, input_dim=K]
     row-major (A[i,j] at i*K+j per PyTorch/SafeTensors/GGUF convention
     and PMAT-333 F32 dequantization output).

Symptom: GPU read transposed weights → computed y = A^T @ x instead
         of y = A @ x → systematically anti-correlated logits
         (cos=-0.005190 vs CPU, top-10 divergences all sign-flipped,
         CPU mean=-2.42 vs GPU mean=0.013).

Fix: rewrite the inner loop to iterate along the K dimension within
     row block_id:
       row_base = a_ptr + block_id * K * 4
       thread reads A[block_id, t], A[block_id, t+32], ...
     instead of:
       col_base = a_ptr + block_id * 4
       thread reads A[t, block_id], A[t+32, block_id], ...

Empirical discharge (canonical 7B teacher, lambda-vector RTX 4090,
default graphed path):

  PARITY-GATE: PASS (no error from forward_gpu_resident)
  Throughput @ 128-tok 5-iter decode: 124.6 tok/s
  AC-SHIP1-007 floor: 30 tok/s
  Headroom: 4.15× over floor
  TTFT: 8.39 ms
  p50 latency: 1016 ms

Before PR-E:
  PARITY-GATE FAILED cos=-0.005190
  Throughput (with SKIP_PARITY_GATE=1 + SKIP_FP8_WARMUP=1): 5.6 tok/s (§63) / 54.5 tok/s (§73)
  GPU CANNOT serve this model

After PR-E:
  PARITY-GATE PASS, default path, NO workarounds
  124.6 tok/s, 4.15× over floor

Ship-% impact:
  MODEL-1 ship %: **99% → 100%**
  10 of 10 AC-SHIP1-* LIVE-DISCHARGED:
    SHIP-001 (§72)  SHIP-002 (§61)  SHIP-003 (§72)
    SHIP-004 (§72)  SHIP-005 (§71)  SHIP-006 (§61.8)
    SHIP-007 (this PR)  SHIP-008 (§61)  SHIP-009 (§72)
    SHIP-010 (§72)

  MODEL-2 ship %: unchanged at 57% (independent track).

Cascade arc closeout: §63 → §73 → PR-A (#1648) → PR-B (#1649)
→ §74 (#1650) → PR-E (this). One PR shipped in 1 day after §73's
'3-5 PR / 3-5 day' estimate.

Auxiliary change: logits.rs adds APR_LM_HEAD_FORCE_QTYPE env-var
probe kept as a diagnostic tool (zero behavior change when unset).

Test plan:
- [x] cargo build --release -p apr-cli --bin apr --features cuda → clean
- [x] apr bench (default path, 128-tok 5-iter) → 124.6 tok/s, passed: true
- [x] apr parity → PARITY-GATE PASS
- [ ] CI tests (workspace-test on per-PR runner)

Refs:
- §74 SHIP-007 bug localized (PR #1650)
- §73 SHIP-007 cascade reduction (PR #1647)
- contracts/apr-ship-007-gpu-stage-bisection-v1.yaml (PR-A #1648 contract)
- PR #1649 (PR-B GPU stage dump scaffold)
- AC-SHIP1-007 (spec §5)
- evidence/section-75-ship-007-discharged-2026-05-13/

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…07 contract violation (PMAT-CODE-SHIP-007-PR-E-FALSIFY-007-CLEAN)

The env-var bisection probe added in PR-E (this branch) introduced a
`_ =>` catch-all inside a `match` expression that referenced
`WeightQuantType` in its arm values. The `falsify_007_no_catch_all_
in_dispatch_sites` contract test's 30-line walk-back heuristic flagged
this as a violation, even though the match was on `&str` (env var
value), not on `WeightQuantType`.

The probe was a bisection tool used to identify the bug location
during §74. Now that §75 has shipped the actual fix and the probe is
no longer needed, removing it cleans up the contract violation.

The remaining PR-E change is solely the F32 GEMV PTX kernel layout
fix in `crates/aprender-gpu/src/kernels/gemv/mod.rs` — that's the
actual bug fix.

Test verified:
  cargo test -p aprender-serve --lib \
      quantize::contract_tests::tests::falsify_007_no_catch_all_in_dispatch_sites
  → 1 passed

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds `contracts/model-families/granite.yaml` so apr-cookbook's
architecture-demos spec flips Granite from `status: blocked` to live.

Granite 3.x dense models follow LLaMA-3 architecture (GQA + RoPE + SwiGLU
+ RMSNorm) with the IBM 49152-token vocab and tied embeddings. No engine
change needed — `from_model_type("granite" | "granite3")` already returns
`Architecture::Llama`, and `kernel_explain/resolve.rs` already aliases
`granite → GraniteForCausalLM`.

Size variants: 2b (granite-3.1-2b-base) and 8b (granite-3.1-8b-base).
MoE variants (granite-3.0-3b-a800m-*) use a separate
GraniteMoeForCausalLM architecture and are out of scope.

Verified:
- `pv validate contracts/model-families/granite.yaml` → 0 errors
- FALSIFY-PARITY-002 (`test_every_model_family_yaml_has_architecture`)
  passes — the family is recognized by `from_model_type` → `Self::Llama`.

References: granite-3.1-2b-base / granite-3.1-8b-base HF config.json.
@noahgift noahgift enabled auto-merge (squash) May 13, 2026 16:04
@noahgift noahgift merged commit fa2bbb4 into main May 13, 2026
10 checks passed
@noahgift noahgift deleted the fix/1588-granite-rosetta-family branch May 13, 2026 23:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: add IBM Granite (GraniteForCausalLM) loader to aprender::rosetta

1 participant