fix(ci): F-203 SIMD timing flake — main CI andon#875
Merged
Conversation
…t flake **Andon**: main has been red since 96d7349 (PR #869 merge). The failing test is `quantize::tests::tests_25::test_f203_simd_faster_than_scalar_q4_0`: single-shot timing on a 256×256 Q4_0 matvec got Scalar=122ms, SIMD=136ms (speedup 0.90×) — not a regression, pure OS/CPU jitter. **Root cause**: the test measured exactly one 100-iteration run of each path. On shared CI runners, a single run is dominated by cache state, frequency scaling, and neighbor-process preemption. SIMD timing was sometimes slower than scalar purely from environmental noise. **Fix**: warmup round + best-of-5 rounds, take the minimum of each. The minimum is a lower-jitter estimator of the underlying hardware cost. If SIMD's best measurement is still slower than scalar's best, that's a real regression worth failing CI — the Popperian falsification property of F-203 is preserved, not weakened. **Verification** (4090 Yoga runner, debug build): F203: Q4_0 Performance Falsification (best-of-5) Scalar (min): 47.93ms SIMD (min): 46.58ms Speedup: 1.03x Threshold `speedup > 1.0` unchanged. Test is now deterministic within measurement precision. Also picks up a pre-existing trailing-blank-line fmt drift in `crates/aprender-serve/src/contract_gate.rs` that `cargo fmt -p aprender-serve` corrected as a collateral effect. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
5 tasks
noahgift
added a commit
that referenced
this pull request
May 13, 2026
…t flake (#875) **Andon**: main has been red since 96d7349 (PR #869 merge). The failing test is `quantize::tests::tests_25::test_f203_simd_faster_than_scalar_q4_0`: single-shot timing on a 256×256 Q4_0 matvec got Scalar=122ms, SIMD=136ms (speedup 0.90×) — not a regression, pure OS/CPU jitter. **Root cause**: the test measured exactly one 100-iteration run of each path. On shared CI runners, a single run is dominated by cache state, frequency scaling, and neighbor-process preemption. SIMD timing was sometimes slower than scalar purely from environmental noise. **Fix**: warmup round + best-of-5 rounds, take the minimum of each. The minimum is a lower-jitter estimator of the underlying hardware cost. If SIMD's best measurement is still slower than scalar's best, that's a real regression worth failing CI — the Popperian falsification property of F-203 is preserved, not weakened. **Verification** (4090 Yoga runner, debug build): F203: Q4_0 Performance Falsification (best-of-5) Scalar (min): 47.93ms SIMD (min): 46.58ms Speedup: 1.03x Threshold `speedup > 1.0` unchanged. Test is now deterministic within measurement precision. Also picks up a pre-existing trailing-blank-line fmt drift in `crates/aprender-serve/src/contract_gate.rs` that `cargo fmt -p aprender-serve` corrected as a collateral effect. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
ANDON: main has been red since 96d7349 (PR #869 merge)
Failing:
quantize::tests::tests_25::test_f203_simd_faster_than_scalar_q4_0Root cause
Single-shot timing. The test measured exactly one 100-iteration run per path. On shared CI runners, one measurement is dominated by cache state, frequency scaling, and neighbor-process preemption — SIMD sometimes timed slower than scalar purely from environmental noise.
CI log snippet:
Not a SIMD regression. The scalar and SIMD paths are unchanged.
Fix
Warmup round + best-of-5, take the minimum of each path. The minimum is a lower-jitter estimator of the underlying hardware cost.
The falsification assertion is unchanged (
speedup > 1.0). If SIMD's best-of-5 is still slower than scalar's best-of-5, that's a real regression — the Popperian intent of F-203 is preserved, not weakened.Verification (4090 Yoga runner, debug build)
Incidental change
cargo fmt -p aprender-servecorrected a pre-existing trailing-blank-line incrates/aprender-serve/src/contract_gate.rs. Included rather than reverted — the tree is now fmt-clean.Test plan
cargo test -p aprender-serve --lib test_f203_simd_faster_than_scalar_q4_0 -- --nocapturepasses stablyworkspace-testgreenci / gategreen🤖 Generated with Claude Code