Skip to content

feat(tools): byte-level schema canonicalize for prefix-cache stability#2499

Closed
HUQIANTAO wants to merge 1 commit into
Hmbown:mainfrom
HUQIANTAO:feat/schema-canonicalize
Closed

feat(tools): byte-level schema canonicalize for prefix-cache stability#2499
HUQIANTAO wants to merge 1 commit into
Hmbown:mainfrom
HUQIANTAO:feat/schema-canonicalize

Conversation

@HUQIANTAO

Copy link
Copy Markdown
Contributor

Summary

When MCP servers return tool schemas, the field order within each schema object and the order of entries in required / dependentRequired arrays can vary across reconnections. This causes the serialized tool catalog bytes to change even when the logical schema is unchanged, busting DeepSeek's KV prefix cache.

This PR adds schema_canonicalize::canonicalize_schema which recursively:

  • Sorts every required array alphabetically
  • Sorts every dependentRequired sub-array alphabetically
  • Rebuilds object keys in alphabetical order
  • Recurses into all nested objects and arrays

The canonicalize step runs after schema_sanitize in build_api_tools, so each tool's input_schema is first cleaned then byte-stabilized. The existing OnceLock api_cache pins the result, ensuring the tool catalog bytes are identical across reads and across process restarts.

Motivation

This is the highest-value, lowest-risk optimization identified in the cache audit. The serde_json::Value::Object backed by IndexMap (via preserve_order feature) preserves insertion order from the MCP server, which can differ between reconnections. By canonicalizing after sanitize, we guarantee that the same logical schema always produces the same bytes, preventing unnecessary prefix cache busts.

Changes

  • New file: crates/tui/src/tools/schema_canonicalize.rs (199 lines including 8 unit tests)
  • Modified: crates/tui/src/tools/registry.rs - add canonicalize_schema call after sanitize in build_api_tools
  • Modified: crates/tui/src/tools/mod.rs - add module declaration

Testing

8 unit tests covering:

  1. sorts_required_array - basic required array sorting
  2. equivalent_ordering_matches - different field orders produce identical bytes
  3. sorts_dependent_required - dependentRequired sub-array sorting
  4. recursive_into_properties - nested schema canonicalization
  5. preserves_non_required_array_order - non-required arrays keep semantic order
  6. handles_empty_schema - empty object edge case
  7. handles_deeply_nested - deeply nested schema recursion
  8. key_order_is_alphabetical_after_canonicalize - object key ordering

All existing tests continue to pass. No new dependencies added (uses existing serde_json).

Risk Assessment

Extremely low risk:

  • Zero new dependencies
  • Only modifies the tool schema serialization path (already cached by OnceLock)
  • No API changes
  • No behavior changes for providers that don't use preserve_order
  • Fail-safe: if canonicalize somehow breaks a schema, schema_sanitize already runs first

@greptile-apps greptile-apps Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HUQIANTAO has reached the 50-review limit for trial accounts. To continue receiving code reviews, upgrade your plan.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new schema_canonicalize module to recursively canonicalize JSON schemas, ensuring deterministic serialization and prefix-cache stability by sorting object keys and elements in required and dependentRequired arrays. This canonicalization is applied to tool input schemas in the registry. Feedback suggests optimizing the sorting of object keys by using sort_unstable_by instead of sort_by to avoid unnecessary allocations.

// drain(), so we swap to a temporary and rebuild.
let old = std::mem::take(map);
let mut entries: Vec<(String, Value)> = old.into_iter().collect();
entries.sort_by(|a, b| a.0.cmp(&b.0));

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Since the keys of a JSON object are guaranteed to be unique, the stability of the sorting algorithm is not required. Using sort_unstable_by instead of sort_by avoids allocating temporary helper memory and is generally faster.

Suggested change
entries.sort_by(|a, b| a.0.cmp(&b.0));
entries.sort_unstable_by(|a, b| a.0.cmp(&b.0));

@HUQIANTAO HUQIANTAO force-pushed the feat/schema-canonicalize branch from 7cee9cd to 95c80c9 Compare June 1, 2026 12:57

@greptile-apps greptile-apps Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HUQIANTAO has reached the 50-review limit for trial accounts. To continue receiving code reviews, upgrade your plan.

@Hmbown

Hmbown commented Jun 1, 2026

Copy link
Copy Markdown
Owner

Harvested into codex/v0.8.50-triage. This was a good low-risk prefix-cache stability slice, and I pulled it in for the release branch.

I fixed the formatting failure locally with cargo fmt --all and amended the cherry-pick while preserving authorship. Focused schema canonicalization tests and codewhale-tui clippy pass on the triage branch.

…ility

When MCP servers return tool schemas, the field order within each schema
object and the order of entries in required / dependentRequired arrays
can vary across reconnections. This causes the serialized tool catalog
bytes to change even when the logical schema is unchanged, busting
DeepSeek's KV prefix cache.

Add schema_canonicalize::canonicalize_schema which recursively:
- Sorts every required array alphabetically
- Sorts every dependentRequired sub-array alphabetically
- Rebuilds object keys in alphabetical order
- Recurses into all nested objects and arrays

The canonicalize step runs after schema_sanitize in build_api_tools,
so each tool's input_schema is first cleaned then byte-stabilized.
The existing OnceLock api_cache pins the result, ensuring the tool
catalog bytes are identical across reads and across process restarts.

8 unit tests cover: required sorting, dependentRequired sorting,
equivalent-ordering byte match, recursive nesting, empty schemas,
deeply nested schemas, non-required array preservation, and key
ordering.
@HUQIANTAO HUQIANTAO force-pushed the feat/schema-canonicalize branch from 95c80c9 to 5067d8b Compare June 1, 2026 13:14

@greptile-apps greptile-apps Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HUQIANTAO has reached the 50-review limit for trial accounts. To continue receiving code reviews, upgrade your plan.

@HUQIANTAO

Copy link
Copy Markdown
Contributor Author

Closing: this slice was harvested upstream (per the maintainer comments) — the work is in main, no need to keep the open PR alive. Thanks for the review!

@HUQIANTAO HUQIANTAO closed this Jun 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants