Skip to content

test(mcp): FALSIFY-MCP-002 strict — JSON Schema Draft 7 meta-validation#869

Merged
noahgift merged 1 commit into
mainfrom
feat/apr-mcp-schema-validation
Apr 18, 2026
Merged

test(mcp): FALSIFY-MCP-002 strict — JSON Schema Draft 7 meta-validation#869
noahgift merged 1 commit into
mainfrom
feat/apr-mcp-schema-validation

Conversation

@noahgift

Copy link
Copy Markdown
Contributor

Summary

  • Adds tests/falsify_schema.rs that compiles every shipped tool's inputSchema with jsonschema::validator_for. Compilation performs meta-schema validation, so any malformed schema fails before it can ship.
  • Includes a guard test feeding a known-bad schema to the same path — protects against a future jsonschema upgrade that silently accepts garbage.

Why

The base FALSIFY-MCP-002 only asserts inputSchema.type == "object". The MCP spec demands the schema validates against JSONSchema Draft 7. A tool shipping with a malformed properties map or nonexistent type would pass the existing test today.

Cargo.lock impact

Single dev-dep edge added. jsonschema 0.28 is already in the lockfile via aprender-train, so zero new transitive crates.

Test plan

  • cargo test -p aprender-mcp --test falsify_schema — both tests pass
  • Full cargo test -p aprender-mcp still green
  • cargo clippy -p aprender-mcp --all-targets -- -D warnings clean

🤖 Generated with Claude Code

The base FALSIFY-MCP-002 only asserts that each tool's `inputSchema.type ==
"object"`, which is necessary but not sufficient: the MCP spec requires the
schema to validate against JSONSchema Draft 7. A tool definition could ship
with a malformed `properties` map or a nonexistent `type` and still pass.

This adds a separate `tests/falsify_schema.rs` that compiles every shipped
tool's `inputSchema` with `jsonschema::validator_for`. Compilation performs
meta-schema validation as a side effect, so any malformed schema fails the
test before it can ship to a real client.

Includes a guard test that feeds a known-bad schema (`properties` set to a
string) to the same validator path, asserting that compilation rejects it.
This catches a future jsonschema upgrade that silently accepts garbage.

Dev-dep only: jsonschema 0.28 is already in Cargo.lock via aprender-train,
so no new transitive crates are pulled in.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift enabled auto-merge (squash) April 18, 2026 03:55
@noahgift noahgift merged commit 96d7349 into main Apr 18, 2026
11 checks passed
@noahgift noahgift deleted the feat/apr-mcp-schema-validation branch April 18, 2026 04:07
noahgift added a commit that referenced this pull request Apr 18, 2026
…t flake

**Andon**: main has been red since 96d7349 (PR #869 merge). The failing
test is `quantize::tests::tests_25::test_f203_simd_faster_than_scalar_q4_0`:
single-shot timing on a 256×256 Q4_0 matvec got Scalar=122ms, SIMD=136ms
(speedup 0.90×) — not a regression, pure OS/CPU jitter.

**Root cause**: the test measured exactly one 100-iteration run of each
path. On shared CI runners, a single run is dominated by cache state,
frequency scaling, and neighbor-process preemption. SIMD timing was
sometimes slower than scalar purely from environmental noise.

**Fix**: warmup round + best-of-5 rounds, take the minimum of each. The
minimum is a lower-jitter estimator of the underlying hardware cost. If
SIMD's best measurement is still slower than scalar's best, that's a
real regression worth failing CI — the Popperian falsification property
of F-203 is preserved, not weakened.

**Verification** (4090 Yoga runner, debug build):
  F203: Q4_0 Performance Falsification (best-of-5)
    Scalar (min): 47.93ms
    SIMD   (min): 46.58ms
    Speedup: 1.03x

Threshold `speedup > 1.0` unchanged. Test is now deterministic within
measurement precision.

Also picks up a pre-existing trailing-blank-line fmt drift in
`crates/aprender-serve/src/contract_gate.rs` that `cargo fmt -p
aprender-serve` corrected as a collateral effect.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
noahgift added a commit that referenced this pull request Apr 18, 2026
…t flake (#875)

**Andon**: main has been red since 96d7349 (PR #869 merge). The failing
test is `quantize::tests::tests_25::test_f203_simd_faster_than_scalar_q4_0`:
single-shot timing on a 256×256 Q4_0 matvec got Scalar=122ms, SIMD=136ms
(speedup 0.90×) — not a regression, pure OS/CPU jitter.

**Root cause**: the test measured exactly one 100-iteration run of each
path. On shared CI runners, a single run is dominated by cache state,
frequency scaling, and neighbor-process preemption. SIMD timing was
sometimes slower than scalar purely from environmental noise.

**Fix**: warmup round + best-of-5 rounds, take the minimum of each. The
minimum is a lower-jitter estimator of the underlying hardware cost. If
SIMD's best measurement is still slower than scalar's best, that's a
real regression worth failing CI — the Popperian falsification property
of F-203 is preserved, not weakened.

**Verification** (4090 Yoga runner, debug build):
  F203: Q4_0 Performance Falsification (best-of-5)
    Scalar (min): 47.93ms
    SIMD   (min): 46.58ms
    Speedup: 1.03x

Threshold `speedup > 1.0` unchanged. Test is now deterministic within
measurement precision.

Also picks up a pre-existing trailing-blank-line fmt drift in
`crates/aprender-serve/src/contract_gate.rs` that `cargo fmt -p
aprender-serve` corrected as a collateral effect.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
noahgift added a commit that referenced this pull request May 13, 2026
…on (#869)

The base FALSIFY-MCP-002 only asserts that each tool's `inputSchema.type ==
"object"`, which is necessary but not sufficient: the MCP spec requires the
schema to validate against JSONSchema Draft 7. A tool definition could ship
with a malformed `properties` map or a nonexistent `type` and still pass.

This adds a separate `tests/falsify_schema.rs` that compiles every shipped
tool's `inputSchema` with `jsonschema::validator_for`. Compilation performs
meta-schema validation as a side effect, so any malformed schema fails the
test before it can ship to a real client.

Includes a guard test that feeds a known-bad schema (`properties` set to a
string) to the same validator path, asserting that compilation rejects it.
This catches a future jsonschema upgrade that silently accepts garbage.

Dev-dep only: jsonschema 0.28 is already in Cargo.lock via aprender-train,
so no new transitive crates are pulled in.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 13, 2026
…t flake (#875)

**Andon**: main has been red since 96d7349 (PR #869 merge). The failing
test is `quantize::tests::tests_25::test_f203_simd_faster_than_scalar_q4_0`:
single-shot timing on a 256×256 Q4_0 matvec got Scalar=122ms, SIMD=136ms
(speedup 0.90×) — not a regression, pure OS/CPU jitter.

**Root cause**: the test measured exactly one 100-iteration run of each
path. On shared CI runners, a single run is dominated by cache state,
frequency scaling, and neighbor-process preemption. SIMD timing was
sometimes slower than scalar purely from environmental noise.

**Fix**: warmup round + best-of-5 rounds, take the minimum of each. The
minimum is a lower-jitter estimator of the underlying hardware cost. If
SIMD's best measurement is still slower than scalar's best, that's a
real regression worth failing CI — the Popperian falsification property
of F-203 is preserved, not weakened.

**Verification** (4090 Yoga runner, debug build):
  F203: Q4_0 Performance Falsification (best-of-5)
    Scalar (min): 47.93ms
    SIMD   (min): 46.58ms
    Speedup: 1.03x

Threshold `speedup > 1.0` unchanged. Test is now deterministic within
measurement precision.

Also picks up a pre-existing trailing-blank-line fmt drift in
`crates/aprender-serve/src/contract_gate.rs` that `cargo fmt -p
aprender-serve` corrected as a collateral effect.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant