test(mcp): FALSIFY-MCP-002 strict — JSON Schema Draft 7 meta-validation#869
Merged
Conversation
The base FALSIFY-MCP-002 only asserts that each tool's `inputSchema.type == "object"`, which is necessary but not sufficient: the MCP spec requires the schema to validate against JSONSchema Draft 7. A tool definition could ship with a malformed `properties` map or a nonexistent `type` and still pass. This adds a separate `tests/falsify_schema.rs` that compiles every shipped tool's `inputSchema` with `jsonschema::validator_for`. Compilation performs meta-schema validation as a side effect, so any malformed schema fails the test before it can ship to a real client. Includes a guard test that feeds a known-bad schema (`properties` set to a string) to the same validator path, asserting that compilation rejects it. This catches a future jsonschema upgrade that silently accepts garbage. Dev-dep only: jsonschema 0.28 is already in Cargo.lock via aprender-train, so no new transitive crates are pulled in. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
Apr 18, 2026
…t flake **Andon**: main has been red since 96d7349 (PR #869 merge). The failing test is `quantize::tests::tests_25::test_f203_simd_faster_than_scalar_q4_0`: single-shot timing on a 256×256 Q4_0 matvec got Scalar=122ms, SIMD=136ms (speedup 0.90×) — not a regression, pure OS/CPU jitter. **Root cause**: the test measured exactly one 100-iteration run of each path. On shared CI runners, a single run is dominated by cache state, frequency scaling, and neighbor-process preemption. SIMD timing was sometimes slower than scalar purely from environmental noise. **Fix**: warmup round + best-of-5 rounds, take the minimum of each. The minimum is a lower-jitter estimator of the underlying hardware cost. If SIMD's best measurement is still slower than scalar's best, that's a real regression worth failing CI — the Popperian falsification property of F-203 is preserved, not weakened. **Verification** (4090 Yoga runner, debug build): F203: Q4_0 Performance Falsification (best-of-5) Scalar (min): 47.93ms SIMD (min): 46.58ms Speedup: 1.03x Threshold `speedup > 1.0` unchanged. Test is now deterministic within measurement precision. Also picks up a pre-existing trailing-blank-line fmt drift in `crates/aprender-serve/src/contract_gate.rs` that `cargo fmt -p aprender-serve` corrected as a collateral effect. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
3 tasks
noahgift
added a commit
that referenced
this pull request
Apr 18, 2026
…t flake (#875) **Andon**: main has been red since 96d7349 (PR #869 merge). The failing test is `quantize::tests::tests_25::test_f203_simd_faster_than_scalar_q4_0`: single-shot timing on a 256×256 Q4_0 matvec got Scalar=122ms, SIMD=136ms (speedup 0.90×) — not a regression, pure OS/CPU jitter. **Root cause**: the test measured exactly one 100-iteration run of each path. On shared CI runners, a single run is dominated by cache state, frequency scaling, and neighbor-process preemption. SIMD timing was sometimes slower than scalar purely from environmental noise. **Fix**: warmup round + best-of-5 rounds, take the minimum of each. The minimum is a lower-jitter estimator of the underlying hardware cost. If SIMD's best measurement is still slower than scalar's best, that's a real regression worth failing CI — the Popperian falsification property of F-203 is preserved, not weakened. **Verification** (4090 Yoga runner, debug build): F203: Q4_0 Performance Falsification (best-of-5) Scalar (min): 47.93ms SIMD (min): 46.58ms Speedup: 1.03x Threshold `speedup > 1.0` unchanged. Test is now deterministic within measurement precision. Also picks up a pre-existing trailing-blank-line fmt drift in `crates/aprender-serve/src/contract_gate.rs` that `cargo fmt -p aprender-serve` corrected as a collateral effect. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
5 tasks
noahgift
added a commit
that referenced
this pull request
May 13, 2026
…on (#869) The base FALSIFY-MCP-002 only asserts that each tool's `inputSchema.type == "object"`, which is necessary but not sufficient: the MCP spec requires the schema to validate against JSONSchema Draft 7. A tool definition could ship with a malformed `properties` map or a nonexistent `type` and still pass. This adds a separate `tests/falsify_schema.rs` that compiles every shipped tool's `inputSchema` with `jsonschema::validator_for`. Compilation performs meta-schema validation as a side effect, so any malformed schema fails the test before it can ship to a real client. Includes a guard test that feeds a known-bad schema (`properties` set to a string) to the same validator path, asserting that compilation rejects it. This catches a future jsonschema upgrade that silently accepts garbage. Dev-dep only: jsonschema 0.28 is already in Cargo.lock via aprender-train, so no new transitive crates are pulled in. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
May 13, 2026
…t flake (#875) **Andon**: main has been red since 96d7349 (PR #869 merge). The failing test is `quantize::tests::tests_25::test_f203_simd_faster_than_scalar_q4_0`: single-shot timing on a 256×256 Q4_0 matvec got Scalar=122ms, SIMD=136ms (speedup 0.90×) — not a regression, pure OS/CPU jitter. **Root cause**: the test measured exactly one 100-iteration run of each path. On shared CI runners, a single run is dominated by cache state, frequency scaling, and neighbor-process preemption. SIMD timing was sometimes slower than scalar purely from environmental noise. **Fix**: warmup round + best-of-5 rounds, take the minimum of each. The minimum is a lower-jitter estimator of the underlying hardware cost. If SIMD's best measurement is still slower than scalar's best, that's a real regression worth failing CI — the Popperian falsification property of F-203 is preserved, not weakened. **Verification** (4090 Yoga runner, debug build): F203: Q4_0 Performance Falsification (best-of-5) Scalar (min): 47.93ms SIMD (min): 46.58ms Speedup: 1.03x Threshold `speedup > 1.0` unchanged. Test is now deterministic within measurement precision. Also picks up a pre-existing trailing-blank-line fmt drift in `crates/aprender-serve/src/contract_gate.rs` that `cargo fmt -p aprender-serve` corrected as a collateral effect. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
tests/falsify_schema.rsthat compiles every shipped tool'sinputSchemawithjsonschema::validator_for. Compilation performs meta-schema validation, so any malformed schema fails before it can ship.Why
The base FALSIFY-MCP-002 only asserts
inputSchema.type == "object". The MCP spec demands the schema validates against JSONSchema Draft 7. A tool shipping with a malformedpropertiesmap or nonexistenttypewould pass the existing test today.Cargo.lock impact
Single dev-dep edge added.
jsonschema 0.28is already in the lockfile viaaprender-train, so zero new transitive crates.Test plan
cargo test -p aprender-mcp --test falsify_schema— both tests passcargo test -p aprender-mcpstill greencargo clippy -p aprender-mcp --all-targets -- -D warningsclean🤖 Generated with Claude Code