Skip to content

fix(ci): F-203 SIMD timing flake — main CI andon#875

Merged
noahgift merged 2 commits into
mainfrom
fix/ci-f203-flaky-simd-timing
Apr 18, 2026
Merged

fix(ci): F-203 SIMD timing flake — main CI andon#875
noahgift merged 2 commits into
mainfrom
fix/ci-f203-flaky-simd-timing

Conversation

@noahgift

Copy link
Copy Markdown
Contributor

ANDON: main has been red since 96d7349 (PR #869 merge)

Failing: quantize::tests::tests_25::test_f203_simd_faster_than_scalar_q4_0

Root cause

Single-shot timing. The test measured exactly one 100-iteration run per path. On shared CI runners, one measurement is dominated by cache state, frequency scaling, and neighbor-process preemption — SIMD sometimes timed slower than scalar purely from environmental noise.

CI log snippet:

F203: Q4_0 Performance Falsification
  Scalar: 122.13127ms
  SIMD:   135.973634ms
  Speedup: 0.90x

Not a SIMD regression. The scalar and SIMD paths are unchanged.

Fix

Warmup round + best-of-5, take the minimum of each path. The minimum is a lower-jitter estimator of the underlying hardware cost.

The falsification assertion is unchanged (speedup > 1.0). If SIMD's best-of-5 is still slower than scalar's best-of-5, that's a real regression — the Popperian intent of F-203 is preserved, not weakened.

Verification (4090 Yoga runner, debug build)

F203: Q4_0 Performance Falsification (best-of-5)
  Scalar (min): 47.93ms
  SIMD   (min): 46.58ms
  Speedup: 1.03x

Incidental change

cargo fmt -p aprender-serve corrected a pre-existing trailing-blank-line in crates/aprender-serve/src/contract_gate.rs. Included rather than reverted — the tree is now fmt-clean.

Test plan

  • cargo test -p aprender-serve --lib test_f203_simd_faster_than_scalar_q4_0 -- --nocapture passes stably
  • CI workspace-test green
  • CI ci / gate green

🤖 Generated with Claude Code

…t flake

**Andon**: main has been red since 96d7349 (PR #869 merge). The failing
test is `quantize::tests::tests_25::test_f203_simd_faster_than_scalar_q4_0`:
single-shot timing on a 256×256 Q4_0 matvec got Scalar=122ms, SIMD=136ms
(speedup 0.90×) — not a regression, pure OS/CPU jitter.

**Root cause**: the test measured exactly one 100-iteration run of each
path. On shared CI runners, a single run is dominated by cache state,
frequency scaling, and neighbor-process preemption. SIMD timing was
sometimes slower than scalar purely from environmental noise.

**Fix**: warmup round + best-of-5 rounds, take the minimum of each. The
minimum is a lower-jitter estimator of the underlying hardware cost. If
SIMD's best measurement is still slower than scalar's best, that's a
real regression worth failing CI — the Popperian falsification property
of F-203 is preserved, not weakened.

**Verification** (4090 Yoga runner, debug build):
  F203: Q4_0 Performance Falsification (best-of-5)
    Scalar (min): 47.93ms
    SIMD   (min): 46.58ms
    Speedup: 1.03x

Threshold `speedup > 1.0` unchanged. Test is now deterministic within
measurement precision.

Also picks up a pre-existing trailing-blank-line fmt drift in
`crates/aprender-serve/src/contract_gate.rs` that `cargo fmt -p
aprender-serve` corrected as a collateral effect.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
@noahgift noahgift enabled auto-merge (squash) April 18, 2026 04:34
@noahgift noahgift merged commit 83bbf49 into main Apr 18, 2026
10 checks passed
@noahgift noahgift deleted the fix/ci-f203-flaky-simd-timing branch April 18, 2026 05:12
noahgift added a commit that referenced this pull request May 13, 2026
…t flake (#875)

**Andon**: main has been red since 96d7349 (PR #869 merge). The failing
test is `quantize::tests::tests_25::test_f203_simd_faster_than_scalar_q4_0`:
single-shot timing on a 256×256 Q4_0 matvec got Scalar=122ms, SIMD=136ms
(speedup 0.90×) — not a regression, pure OS/CPU jitter.

**Root cause**: the test measured exactly one 100-iteration run of each
path. On shared CI runners, a single run is dominated by cache state,
frequency scaling, and neighbor-process preemption. SIMD timing was
sometimes slower than scalar purely from environmental noise.

**Fix**: warmup round + best-of-5 rounds, take the minimum of each. The
minimum is a lower-jitter estimator of the underlying hardware cost. If
SIMD's best measurement is still slower than scalar's best, that's a
real regression worth failing CI — the Popperian falsification property
of F-203 is preserved, not weakened.

**Verification** (4090 Yoga runner, debug build):
  F203: Q4_0 Performance Falsification (best-of-5)
    Scalar (min): 47.93ms
    SIMD   (min): 46.58ms
    Speedup: 1.03x

Threshold `speedup > 1.0` unchanged. Test is now deterministic within
measurement precision.

Also picks up a pre-existing trailing-blank-line fmt drift in
`crates/aprender-serve/src/contract_gate.rs` that `cargo fmt -p
aprender-serve` corrected as a collateral effect.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant