test(mcp): FALSIFY-MCP-002 strict — JSON Schema Draft 7 meta-validation by noahgift · Pull Request #869 · paiml/aprender

noahgift · 2026-04-18T03:55:36Z

Summary

Adds tests/falsify_schema.rs that compiles every shipped tool's inputSchema with jsonschema::validator_for. Compilation performs meta-schema validation, so any malformed schema fails before it can ship.
Includes a guard test feeding a known-bad schema to the same path — protects against a future jsonschema upgrade that silently accepts garbage.

Why

The base FALSIFY-MCP-002 only asserts inputSchema.type == "object". The MCP spec demands the schema validates against JSONSchema Draft 7. A tool shipping with a malformed properties map or nonexistent type would pass the existing test today.

Cargo.lock impact

Single dev-dep edge added. jsonschema 0.28 is already in the lockfile via aprender-train, so zero new transitive crates.

Test plan

cargo test -p aprender-mcp --test falsify_schema — both tests pass
Full cargo test -p aprender-mcp still green
cargo clippy -p aprender-mcp --all-targets -- -D warnings clean

🤖 Generated with Claude Code

The base FALSIFY-MCP-002 only asserts that each tool's `inputSchema.type == "object"`, which is necessary but not sufficient: the MCP spec requires the schema to validate against JSONSchema Draft 7. A tool definition could ship with a malformed `properties` map or a nonexistent `type` and still pass. This adds a separate `tests/falsify_schema.rs` that compiles every shipped tool's `inputSchema` with `jsonschema::validator_for`. Compilation performs meta-schema validation as a side effect, so any malformed schema fails the test before it can ship to a real client. Includes a guard test that feeds a known-bad schema (`properties` set to a string) to the same validator path, asserting that compilation rejects it. This catches a future jsonschema upgrade that silently accepts garbage. Dev-dep only: jsonschema 0.28 is already in Cargo.lock via aprender-train, so no new transitive crates are pulled in. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…t flake **Andon**: main has been red since 96d7349 (PR #869 merge). The failing test is `quantize::tests::tests_25::test_f203_simd_faster_than_scalar_q4_0`: single-shot timing on a 256×256 Q4_0 matvec got Scalar=122ms, SIMD=136ms (speedup 0.90×) — not a regression, pure OS/CPU jitter. **Root cause**: the test measured exactly one 100-iteration run of each path. On shared CI runners, a single run is dominated by cache state, frequency scaling, and neighbor-process preemption. SIMD timing was sometimes slower than scalar purely from environmental noise. **Fix**: warmup round + best-of-5 rounds, take the minimum of each. The minimum is a lower-jitter estimator of the underlying hardware cost. If SIMD's best measurement is still slower than scalar's best, that's a real regression worth failing CI — the Popperian falsification property of F-203 is preserved, not weakened. **Verification** (4090 Yoga runner, debug build): F203: Q4_0 Performance Falsification (best-of-5) Scalar (min): 47.93ms SIMD (min): 46.58ms Speedup: 1.03x Threshold `speedup > 1.0` unchanged. Test is now deterministic within measurement precision. Also picks up a pre-existing trailing-blank-line fmt drift in `crates/aprender-serve/src/contract_gate.rs` that `cargo fmt -p aprender-serve` corrected as a collateral effect. 🤖 Generated with [Claude Code](https://claude.com/claude-code)

…t flake (#875) **Andon**: main has been red since 96d7349 (PR #869 merge). The failing test is `quantize::tests::tests_25::test_f203_simd_faster_than_scalar_q4_0`: single-shot timing on a 256×256 Q4_0 matvec got Scalar=122ms, SIMD=136ms (speedup 0.90×) — not a regression, pure OS/CPU jitter. **Root cause**: the test measured exactly one 100-iteration run of each path. On shared CI runners, a single run is dominated by cache state, frequency scaling, and neighbor-process preemption. SIMD timing was sometimes slower than scalar purely from environmental noise. **Fix**: warmup round + best-of-5 rounds, take the minimum of each. The minimum is a lower-jitter estimator of the underlying hardware cost. If SIMD's best measurement is still slower than scalar's best, that's a real regression worth failing CI — the Popperian falsification property of F-203 is preserved, not weakened. **Verification** (4090 Yoga runner, debug build): F203: Q4_0 Performance Falsification (best-of-5) Scalar (min): 47.93ms SIMD (min): 46.58ms Speedup: 1.03x Threshold `speedup > 1.0` unchanged. Test is now deterministic within measurement precision. Also picks up a pre-existing trailing-blank-line fmt drift in `crates/aprender-serve/src/contract_gate.rs` that `cargo fmt -p aprender-serve` corrected as a collateral effect. 🤖 Generated with [Claude Code](https://claude.com/claude-code)

…on (#869) The base FALSIFY-MCP-002 only asserts that each tool's `inputSchema.type == "object"`, which is necessary but not sufficient: the MCP spec requires the schema to validate against JSONSchema Draft 7. A tool definition could ship with a malformed `properties` map or a nonexistent `type` and still pass. This adds a separate `tests/falsify_schema.rs` that compiles every shipped tool's `inputSchema` with `jsonschema::validator_for`. Compilation performs meta-schema validation as a side effect, so any malformed schema fails the test before it can ship to a real client. Includes a guard test that feeds a known-bad schema (`properties` set to a string) to the same validator path, asserting that compilation rejects it. This catches a future jsonschema upgrade that silently accepts garbage. Dev-dep only: jsonschema 0.28 is already in Cargo.lock via aprender-train, so no new transitive crates are pulled in. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…t flake (#875) **Andon**: main has been red since 96d7349 (PR #869 merge). The failing test is `quantize::tests::tests_25::test_f203_simd_faster_than_scalar_q4_0`: single-shot timing on a 256×256 Q4_0 matvec got Scalar=122ms, SIMD=136ms (speedup 0.90×) — not a regression, pure OS/CPU jitter. **Root cause**: the test measured exactly one 100-iteration run of each path. On shared CI runners, a single run is dominated by cache state, frequency scaling, and neighbor-process preemption. SIMD timing was sometimes slower than scalar purely from environmental noise. **Fix**: warmup round + best-of-5 rounds, take the minimum of each. The minimum is a lower-jitter estimator of the underlying hardware cost. If SIMD's best measurement is still slower than scalar's best, that's a real regression worth failing CI — the Popperian falsification property of F-203 is preserved, not weakened. **Verification** (4090 Yoga runner, debug build): F203: Q4_0 Performance Falsification (best-of-5) Scalar (min): 47.93ms SIMD (min): 46.58ms Speedup: 1.03x Threshold `speedup > 1.0` unchanged. Test is now deterministic within measurement precision. Also picks up a pre-existing trailing-blank-line fmt drift in `crates/aprender-serve/src/contract_gate.rs` that `cargo fmt -p aprender-serve` corrected as a collateral effect. 🤖 Generated with [Claude Code](https://claude.com/claude-code)

noahgift enabled auto-merge (squash) April 18, 2026 03:55

noahgift merged commit 96d7349 into main Apr 18, 2026
11 checks passed

noahgift deleted the feat/apr-mcp-schema-validation branch April 18, 2026 04:07

noahgift mentioned this pull request Apr 18, 2026

fix(ci): F-203 SIMD timing flake — main CI andon #875

Merged

3 tasks

noahgift mentioned this pull request Apr 19, 2026

release: aprender v0.31.0 — consolidated CHANGELOG (MCP M1–M3 + parity epic + SHIP-TWO-001 teacher) #899

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(mcp): FALSIFY-MCP-002 strict — JSON Schema Draft 7 meta-validation#869

test(mcp): FALSIFY-MCP-002 strict — JSON Schema Draft 7 meta-validation#869
noahgift merged 1 commit into
mainfrom
feat/apr-mcp-schema-validation

noahgift commented Apr 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented Apr 18, 2026

Summary

Why

Cargo.lock impact

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant