schema: express conditional-required constraints for grok-style tool callers by jarvis-stark-ops · Pull Request #1 · 1Team-Engineering/hermes-agent

jarvis-stark-ops · 2026-06-06T13:47:53Z

Summary

kanban_complete and patch advertise weaker required than the handler actually enforces
Anthropic models follow the prose hints; xAI grok-4.3 follows JSON Schema literally and omits the fields, then loops on the python validator
Patch expresses the real constraints via JSON Schema 2020-12 anyOf / oneOf
Handlers unchanged — CLI legacy paths still work

Why this matters for 1Team-Engineering

The autonomous JARVIS profile (commit ab2da27 on 1Team-Engineering/hermes-jarvis) runs grok-4.3 in production and was locked out of kanban_complete calls. Banner profile hit the same on patch. This is the minimum schema correction that unblocks grok without changing handler behavior.

Test plan

python3 -c "from jsonschema import validate; ..." against the patched schemas
kanban_complete: 5/5 expected validity outcomes
patch: 6/6 expected validity outcomes
End-to-end dispatch on a real grok task (will validate via JARVIS resume after merge)

Upstream

This fork lives at 1Team-Engineering for production use; an equivalent PR upstream to NousResearch/hermes-agent can follow once we've validated in production for a few days.

🤖 Generated with Claude Code

…callers Two tool schemas advertised `required: []` (or just `["mode"]`) but enforced additional requirements in the Python handler. Anthropic-tuned models like Claude read the prose "REQUIRED PARAMETERS: …" hints and emit the fields. xAI grok-4.3 reads the JSON Schema literally and omits both/all of them, then loops on the Python validator's `tool_error`. This blocked autonomous orchestration profiles (e.g. JARVIS in the 1Team-Engineering hermes-jarvis fleet) running on grok from completing or patching anything. Fix: express the real constraints in the schema using JSON Schema 2020-12's `anyOf` / `oneOf`. Both branches use only `required` so the constraint is honored by every conformant validator. - kanban_complete: anyOf summary or result (matches handler line 543) - patch: oneOf mode=replace (path+old_string+new_string) | mode=patch (patch) (matches handler in _handle_patch) Schema validation tested with jsonschema 4.25.1 (Draft 2020-12): - kanban_complete: 5/5 cases match expected validity - patch: 6/6 cases match expected validity Handlers unchanged — they still tolerate both shapes for CLI legacy callers that bypass the schema layer (e.g. `hermes kanban complete --result "..."`). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…dict, reject respawn Independent review of the first Part 6 commit found three BROKEN issues plus several P1s. This commit addresses every one of them. ## BROKEN #1 — spawned review was stuck in todo (deadlock) `create_task` overrides `initial_status` whenever parents are present and any parent is not done. The umbrella stays `running` until the review approves, but the review can never be claimed because parent isn't done. Classic deadlock. Fix: spawn the integrative review as a PEER task, not a child of the umbrella. The relationship is tracked via the `archive_blocked_pending_integrative_review` event payload (`review_id`) and via title-prefix + tenant lookup. New review correctly lands in `ready` and tchalla can claim it immediately. ## BROKEN #2 — substring verdict matcher was exploitable The old check used `"verdict: approve" in result.lower()` which matches inside prose like "After consideration my verdict: approve would be wrong because...". Reviewers (or hostile patches) could satisfy the gate without ever issuing a canonical verdict. Fix: new `_v6_7_parse_verdict` uses a strict line-anchored regex `^\s*verdict:\s*(approve|reject)\b` with `re.MULTILINE`. First match wins, so `verdict: reject` before `verdict: approve` wins. Prose mentions mid-sentence don't anchor. ## BROKEN #3 — reject path had no re-spawn flow After a reject, `_v6_7_should_spawn_integrative_review` returned False (existing review existed) and the umbrella was permanently stuck. Operators had to manually delete the rejected review row. Fix: `_should_spawn` now considers a done-with-reject review as spawning-ready (after the orchestrator remediates). The next archive call spawns round 2 with title suffix `:r2` (and `:r3`, etc.). The event payload includes `supersedes` and `supersedes_verdict` so operators can audit the round transitions. ## Other findings addressed - BROKEN: `blocked` was in `terminal_statuses` — a blocked Friday means INCOMPLETE work. Removed `blocked` from the set; the gate now only treats `done` / `archived` as terminal for non-review children. - BROKEN: `_v6_7_should_spawn_integrative_review` had no `has_non_review_child` check — review-only chains pathologically spawned. Added the check. - BROKEN: Dead `IntegrativeReviewSpawned` dataclass was declared but never used. Deleted. - WEAK: Integrative-review children were counted as "non-review children" in the previous title-match. The new loop explicitly skips integrative-review children by title prefix. - WEAK: docstring for `archive_task` was missing. Added. ## Tests Rewrote the test file. 27 new tests (up from 14): - TestVerdictParser (8): line-anchored regex correctness, case- insensitive, prose-mention rejection, first-match-wins, empty - TestSpawnTriggerConditions (8): canonical chain → peer review spawned with `ready` status (the critical assertion that catches the deadlock); non-goal_mode archives; orphan archives; review- only does NOT spawn (corrected from prior test); no-review chain doesn't spawn; in-flight build doesn't spawn; blocked build child doesn't qualify as terminal - TestStateMachine (5): no double-spawn on in-flight, approve unblocks, reject triggers respawn on next archive (the critical fix), event emitted on pending, event includes `supersedes` on respawn - TestSpawnedReviewBody (4): scope items present, umbrella id in body, strict verdict format documented, workspace inheritance - TestVerdictBypassClosed (2): prose-approve does NOT unblock, canonical approve does unblock Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…sResearch#62, NousResearch#64) Adds three pre-write-txn gates in `complete_task` mirroring the existing `_verify_created_cards` / `HallucinatedCardsError` pattern: - `verify_runtime_floor` (closes hermes-jarvis#64) — per-role floor on completed_at - started_at. Build 5min, review 90s, orchestration 0. - `verify_workspace_diff` (closes hermes-jarvis#62) — non-review workers on dir/worktree workspaces must produce a non-empty git diff against the tracking base. - `verify_no_stray_artifacts` (closes hermes-jarvis#28) — rejects *evidence*, commit-hash*, triage/*, tmp-*, and untracked no-extension/no-shebang files (the agent-dashboard PR #1 "all prior block evidence files" failure mode). Opt-outs via metadata (x_fast_justified / x_no_code / x_stray_ok) require ≥20-char string reasons and emit completion_opt_out_used audit events with verbatim reason. Truthy bools or short strings rejected with InvalidOptOutError. Context: hermes-jarvis#61 (bootstrap-paradox case study). 41 tests pass; 258 wider regression — zero failures. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds two new completion gates that fire alongside the Part 1/2 gates. Closes hermes-jarvis#63 (PR-existence verification) Closes hermes-jarvis#32 (doc-drift check) Context: hermes-jarvis#61 (bootstrap-paradox case study) ## NousResearch#63 — verify_pr_urls_exist When a verdict result or summary contains a GitHub PR URL pattern, the dispatcher runs `gh pr view <url> --json number` to verify each URL resolves. Phantom URLs (404 with "Not Found" / "Could not resolve" / "no pull request" in stderr) reject the completion. Indeterminate cases (gh missing, network error, unauthenticated) fall open — workers can still complete in offline / broken-gh envs without being trapped. Catches the 2026-06-09 Tchalla case (hermes-jarvis#61): release-gate reviewer blocked with "cannot run gh pr diff 42" on PR NousResearch#42 that didn't exist. With this gate, his completion would have surfaced the phantom URL specifically, prompting him to surface the real cause. Opt-out: `metadata.x_phantom_pr_ok` with ≥20-char string reason. ## NousResearch#32 — verify_doc_drift For tasks whose tenant slug encodes a version (`marvel-swarm-vN-N-test`), the gate scans `README.md` / `README` in the workspace for older `vX.Y` mentions outside a history-style heading (`## History`, `## Older versions`, `## Previous versions`, `## Archive`). Stale mentions reject the completion. CHANGELOG.md is intentionally skipped — older versions are expected there by definition. Catches the 2026-06-09 agent-dashboard PR #1 case (hermes-jarvis#61): README still said "v6.2 Marvel swarm test target" while the chain was v6.6. No reviewer flagged it. Opt-out: `metadata.x_doc_drift_ok` with ≥20-char string reason. ## Tests 19 new tests added to `test_kanban_completion_gates.py`: - TestPRExistence (8) — no PR URL skipped, real passes, phantom rejects, mixed real+phantom flags only phantom, indeterminate falls open, summary scanned, dedup, opt-out - TestDocDrift (10) — non-versioned tenant skips, no README skips, stale README rejects, current README passes, History section excused, CHANGELOG file excused, higher version not stale, scratch skipped, opt-out 83 passed in test_kanban_completion_gates.py (up from 64). Zero regressions on adjacent paths. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…tics, kill dead loop, rename helper Independent review of v6.8 Part 3 flagged three BROKEN issues + a handful of polish items. All addressed. ## #1: LIKE pattern false-positive on summary_preview The substring match ``payload LIKE '%RuntimeFloorViolation%'`` would false-positive on a worker's summary_preview that mentioned the class name in prose (e.g. "Acknowledged the prior RuntimeFloorViolation; now fixing..."). Counter would inflate incorrectly. Fix: anchor the LIKE to the JSON ``"kind":`` field exactly: ``payload LIKE '%"kind": "RuntimeFloorViolation"%'``. Applied to both _v6_7_count_prior_floor_rejections and the _v6_7_heartbeat_floor_status lookup. ## #2: started_at reclaim semantics undocumented verify_runtime_floor reads tasks.started_at which is set ONCE on first claim (COALESCE in claim_task) and NOT updated on reclaim. So a second-attempt worker may pass the floor by elapsed lifetime even if its actual attempt was fast. Floor was designed against first-attempt fabrication; reclaim usually means the chain has been at it for a while already. Fix: added a "Reclaim semantics" paragraph to the docstring of verify_runtime_floor noting the anchor + the design rationale. Future-fix would switch to task_runs.started_at for the active run if this becomes a real problem. ## #3: dead code + unnecessary deferred import in _v6_7_heartbeat_floor_status Old code had a no-op for-loop iterating violations to "find" data that was immediately overwritten from the tasks row, plus a local ``from hermes_cli.kanban_completion_gates import ROLE_RUNTIME_FLOORS_SECONDS`` despite the module already being imported at top. Fix: - Rewrote helper to do a single existence-check on the event (changed SELECT payload → SELECT 1 since we don't parse it anymore). - Removed the dead for-loop. - Hoisted ROLE_RUNTIME_FLOORS_SECONDS to the top-level import block. - Renamed function to _v6_7_heartbeat_floor_status (leading underscore for v6.7-internal convention) and kept the old name as an alias so existing callers keep working. ## Other polish - Escalation message: changed "REJECTED FOR THE 2-th TIME" to "REJECTED #2" — same machine-readable count, no grammar awkwardness. - Added 4 self-review gap-fill tests: - summary_preview with class name doesn't inflate counter - multi-violation event counts once (not twice) - count on unknown task returns zero - escalated message clamps negative seconds_remaining to "Wait 0s" 123/123 in test_kanban_completion_gates.py pass. 219/219 across full v6.7+v6.8 + adjacent regression set, zero failures. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

jarvis-stark-ops merged commit 8ec396d into main Jun 6, 2026

This was referenced Jun 6, 2026

kanban_complete + patch tool schemas advertise weaker required than handler enforces #2

Open

v6.7 Tranche 1: kanban_complete verification gates (#28, #62, #64) #11

Merged

v6.7 Part 5: PR-existence + doc-drift gates (#63, #32) #14

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

schema: express conditional-required constraints for grok-style tool callers#1

schema: express conditional-required constraints for grok-style tool callers#1
jarvis-stark-ops merged 1 commit into
mainfrom
grok-tool-call-tolerance

jarvis-stark-ops commented Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jarvis-stark-ops commented Jun 6, 2026

Summary

Why this matters for 1Team-Engineering

Test plan

Upstream

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant