schema: express conditional-required constraints for grok-style tool callers#1
Merged
Merged
Conversation
…callers
Two tool schemas advertised `required: []` (or just `["mode"]`) but
enforced additional requirements in the Python handler. Anthropic-tuned
models like Claude read the prose "REQUIRED PARAMETERS: …" hints and
emit the fields. xAI grok-4.3 reads the JSON Schema literally and omits
both/all of them, then loops on the Python validator's `tool_error`.
This blocked autonomous orchestration profiles (e.g. JARVIS in the
1Team-Engineering hermes-jarvis fleet) running on grok from completing
or patching anything.
Fix: express the real constraints in the schema using JSON Schema
2020-12's `anyOf` / `oneOf`. Both branches use only `required` so the
constraint is honored by every conformant validator.
- kanban_complete: anyOf summary or result (matches handler line 543)
- patch: oneOf mode=replace (path+old_string+new_string)
| mode=patch (patch)
(matches handler in _handle_patch)
Schema validation tested with jsonschema 4.25.1 (Draft 2020-12):
- kanban_complete: 5/5 cases match expected validity
- patch: 6/6 cases match expected validity
Handlers unchanged — they still tolerate both shapes for CLI legacy
callers that bypass the schema layer (e.g. `hermes kanban complete
--result "..."`).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jarvis-stark-ops
pushed a commit
that referenced
this pull request
Jun 10, 2026
…dict, reject respawn Independent review of the first Part 6 commit found three BROKEN issues plus several P1s. This commit addresses every one of them. ## BROKEN #1 — spawned review was stuck in todo (deadlock) `create_task` overrides `initial_status` whenever parents are present and any parent is not done. The umbrella stays `running` until the review approves, but the review can never be claimed because parent isn't done. Classic deadlock. Fix: spawn the integrative review as a PEER task, not a child of the umbrella. The relationship is tracked via the `archive_blocked_pending_integrative_review` event payload (`review_id`) and via title-prefix + tenant lookup. New review correctly lands in `ready` and tchalla can claim it immediately. ## BROKEN #2 — substring verdict matcher was exploitable The old check used `"verdict: approve" in result.lower()` which matches inside prose like "After consideration my verdict: approve would be wrong because...". Reviewers (or hostile patches) could satisfy the gate without ever issuing a canonical verdict. Fix: new `_v6_7_parse_verdict` uses a strict line-anchored regex `^\s*verdict:\s*(approve|reject)\b` with `re.MULTILINE`. First match wins, so `verdict: reject` before `verdict: approve` wins. Prose mentions mid-sentence don't anchor. ## BROKEN #3 — reject path had no re-spawn flow After a reject, `_v6_7_should_spawn_integrative_review` returned False (existing review existed) and the umbrella was permanently stuck. Operators had to manually delete the rejected review row. Fix: `_should_spawn` now considers a done-with-reject review as spawning-ready (after the orchestrator remediates). The next archive call spawns round 2 with title suffix `:r2` (and `:r3`, etc.). The event payload includes `supersedes` and `supersedes_verdict` so operators can audit the round transitions. ## Other findings addressed - BROKEN: `blocked` was in `terminal_statuses` — a blocked Friday means INCOMPLETE work. Removed `blocked` from the set; the gate now only treats `done` / `archived` as terminal for non-review children. - BROKEN: `_v6_7_should_spawn_integrative_review` had no `has_non_review_child` check — review-only chains pathologically spawned. Added the check. - BROKEN: Dead `IntegrativeReviewSpawned` dataclass was declared but never used. Deleted. - WEAK: Integrative-review children were counted as "non-review children" in the previous title-match. The new loop explicitly skips integrative-review children by title prefix. - WEAK: docstring for `archive_task` was missing. Added. ## Tests Rewrote the test file. 27 new tests (up from 14): - TestVerdictParser (8): line-anchored regex correctness, case- insensitive, prose-mention rejection, first-match-wins, empty - TestSpawnTriggerConditions (8): canonical chain → peer review spawned with `ready` status (the critical assertion that catches the deadlock); non-goal_mode archives; orphan archives; review- only does NOT spawn (corrected from prior test); no-review chain doesn't spawn; in-flight build doesn't spawn; blocked build child doesn't qualify as terminal - TestStateMachine (5): no double-spawn on in-flight, approve unblocks, reject triggers respawn on next archive (the critical fix), event emitted on pending, event includes `supersedes` on respawn - TestSpawnedReviewBody (4): scope items present, umbrella id in body, strict verdict format documented, workspace inheritance - TestVerdictBypassClosed (2): prose-approve does NOT unblock, canonical approve does unblock Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jarvis-stark-ops
added a commit
that referenced
this pull request
Jun 10, 2026
…sResearch#62, NousResearch#64) Adds three pre-write-txn gates in `complete_task` mirroring the existing `_verify_created_cards` / `HallucinatedCardsError` pattern: - `verify_runtime_floor` (closes hermes-jarvis#64) — per-role floor on completed_at - started_at. Build 5min, review 90s, orchestration 0. - `verify_workspace_diff` (closes hermes-jarvis#62) — non-review workers on dir/worktree workspaces must produce a non-empty git diff against the tracking base. - `verify_no_stray_artifacts` (closes hermes-jarvis#28) — rejects *evidence*, commit-hash*, triage/*, tmp-*, and untracked no-extension/no-shebang files (the agent-dashboard PR #1 "all prior block evidence files" failure mode). Opt-outs via metadata (x_fast_justified / x_no_code / x_stray_ok) require ≥20-char string reasons and emit completion_opt_out_used audit events with verbatim reason. Truthy bools or short strings rejected with InvalidOptOutError. Context: hermes-jarvis#61 (bootstrap-paradox case study). 41 tests pass; 258 wider regression — zero failures. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jarvis-stark-ops
pushed a commit
that referenced
this pull request
Jun 10, 2026
Adds two new completion gates that fire alongside the Part 1/2 gates. Closes hermes-jarvis#63 (PR-existence verification) Closes hermes-jarvis#32 (doc-drift check) Context: hermes-jarvis#61 (bootstrap-paradox case study) ## NousResearch#63 — verify_pr_urls_exist When a verdict result or summary contains a GitHub PR URL pattern, the dispatcher runs `gh pr view <url> --json number` to verify each URL resolves. Phantom URLs (404 with "Not Found" / "Could not resolve" / "no pull request" in stderr) reject the completion. Indeterminate cases (gh missing, network error, unauthenticated) fall open — workers can still complete in offline / broken-gh envs without being trapped. Catches the 2026-06-09 Tchalla case (hermes-jarvis#61): release-gate reviewer blocked with "cannot run gh pr diff 42" on PR NousResearch#42 that didn't exist. With this gate, his completion would have surfaced the phantom URL specifically, prompting him to surface the real cause. Opt-out: `metadata.x_phantom_pr_ok` with ≥20-char string reason. ## NousResearch#32 — verify_doc_drift For tasks whose tenant slug encodes a version (`marvel-swarm-vN-N-test`), the gate scans `README.md` / `README` in the workspace for older `vX.Y` mentions outside a history-style heading (`## History`, `## Older versions`, `## Previous versions`, `## Archive`). Stale mentions reject the completion. CHANGELOG.md is intentionally skipped — older versions are expected there by definition. Catches the 2026-06-09 agent-dashboard PR #1 case (hermes-jarvis#61): README still said "v6.2 Marvel swarm test target" while the chain was v6.6. No reviewer flagged it. Opt-out: `metadata.x_doc_drift_ok` with ≥20-char string reason. ## Tests 19 new tests added to `test_kanban_completion_gates.py`: - TestPRExistence (8) — no PR URL skipped, real passes, phantom rejects, mixed real+phantom flags only phantom, indeterminate falls open, summary scanned, dedup, opt-out - TestDocDrift (10) — non-versioned tenant skips, no README skips, stale README rejects, current README passes, History section excused, CHANGELOG file excused, higher version not stale, scratch skipped, opt-out 83 passed in test_kanban_completion_gates.py (up from 64). Zero regressions on adjacent paths. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jarvis-stark-ops
pushed a commit
that referenced
this pull request
Jun 11, 2026
…tics, kill dead loop, rename helper Independent review of v6.8 Part 3 flagged three BROKEN issues + a handful of polish items. All addressed. ## #1: LIKE pattern false-positive on summary_preview The substring match ``payload LIKE '%RuntimeFloorViolation%'`` would false-positive on a worker's summary_preview that mentioned the class name in prose (e.g. "Acknowledged the prior RuntimeFloorViolation; now fixing..."). Counter would inflate incorrectly. Fix: anchor the LIKE to the JSON ``"kind":`` field exactly: ``payload LIKE '%"kind": "RuntimeFloorViolation"%'``. Applied to both _v6_7_count_prior_floor_rejections and the _v6_7_heartbeat_floor_status lookup. ## #2: started_at reclaim semantics undocumented verify_runtime_floor reads tasks.started_at which is set ONCE on first claim (COALESCE in claim_task) and NOT updated on reclaim. So a second-attempt worker may pass the floor by elapsed lifetime even if its actual attempt was fast. Floor was designed against first-attempt fabrication; reclaim usually means the chain has been at it for a while already. Fix: added a "Reclaim semantics" paragraph to the docstring of verify_runtime_floor noting the anchor + the design rationale. Future-fix would switch to task_runs.started_at for the active run if this becomes a real problem. ## #3: dead code + unnecessary deferred import in _v6_7_heartbeat_floor_status Old code had a no-op for-loop iterating violations to "find" data that was immediately overwritten from the tasks row, plus a local ``from hermes_cli.kanban_completion_gates import ROLE_RUNTIME_FLOORS_SECONDS`` despite the module already being imported at top. Fix: - Rewrote helper to do a single existence-check on the event (changed SELECT payload → SELECT 1 since we don't parse it anymore). - Removed the dead for-loop. - Hoisted ROLE_RUNTIME_FLOORS_SECONDS to the top-level import block. - Renamed function to _v6_7_heartbeat_floor_status (leading underscore for v6.7-internal convention) and kept the old name as an alias so existing callers keep working. ## Other polish - Escalation message: changed "REJECTED FOR THE 2-th TIME" to "REJECTED #2" — same machine-readable count, no grammar awkwardness. - Added 4 self-review gap-fill tests: - summary_preview with class name doesn't inflate counter - multi-violation event counts once (not twice) - count on unknown task returns zero - escalated message clamps negative seconds_remaining to "Wait 0s" 123/123 in test_kanban_completion_gates.py pass. 219/219 across full v6.7+v6.8 + adjacent regression set, zero failures. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
kanban_completeandpatchadvertise weakerrequiredthan the handler actually enforcesanyOf/oneOfWhy this matters for 1Team-Engineering
The autonomous JARVIS profile (commit ab2da27 on 1Team-Engineering/hermes-jarvis) runs grok-4.3 in production and was locked out of
kanban_completecalls. Banner profile hit the same onpatch. This is the minimum schema correction that unblocks grok without changing handler behavior.Test plan
python3 -c "from jsonschema import validate; ..."against the patched schemasUpstream
This fork lives at 1Team-Engineering for production use; an equivalent PR upstream to NousResearch/hermes-agent can follow once we've validated in production for a few days.
🤖 Generated with Claude Code