Skip to content

schema: express conditional-required constraints for grok-style tool callers#1

Merged
jarvis-stark-ops merged 1 commit into
mainfrom
grok-tool-call-tolerance
Jun 6, 2026
Merged

schema: express conditional-required constraints for grok-style tool callers#1
jarvis-stark-ops merged 1 commit into
mainfrom
grok-tool-call-tolerance

Conversation

@jarvis-stark-ops

Copy link
Copy Markdown
Collaborator

Summary

  • kanban_complete and patch advertise weaker required than the handler actually enforces
  • Anthropic models follow the prose hints; xAI grok-4.3 follows JSON Schema literally and omits the fields, then loops on the python validator
  • Patch expresses the real constraints via JSON Schema 2020-12 anyOf / oneOf
  • Handlers unchanged — CLI legacy paths still work

Why this matters for 1Team-Engineering

The autonomous JARVIS profile (commit ab2da27 on 1Team-Engineering/hermes-jarvis) runs grok-4.3 in production and was locked out of kanban_complete calls. Banner profile hit the same on patch. This is the minimum schema correction that unblocks grok without changing handler behavior.

Test plan

  • python3 -c "from jsonschema import validate; ..." against the patched schemas
  • kanban_complete: 5/5 expected validity outcomes
  • patch: 6/6 expected validity outcomes
  • End-to-end dispatch on a real grok task (will validate via JARVIS resume after merge)

Upstream

This fork lives at 1Team-Engineering for production use; an equivalent PR upstream to NousResearch/hermes-agent can follow once we've validated in production for a few days.

🤖 Generated with Claude Code

…callers

Two tool schemas advertised `required: []` (or just `["mode"]`) but
enforced additional requirements in the Python handler. Anthropic-tuned
models like Claude read the prose "REQUIRED PARAMETERS: …" hints and
emit the fields. xAI grok-4.3 reads the JSON Schema literally and omits
both/all of them, then loops on the Python validator's `tool_error`.

This blocked autonomous orchestration profiles (e.g. JARVIS in the
1Team-Engineering hermes-jarvis fleet) running on grok from completing
or patching anything.

Fix: express the real constraints in the schema using JSON Schema
2020-12's `anyOf` / `oneOf`. Both branches use only `required` so the
constraint is honored by every conformant validator.

- kanban_complete: anyOf summary or result (matches handler line 543)
- patch:           oneOf mode=replace (path+old_string+new_string)
                        | mode=patch  (patch)
                   (matches handler in _handle_patch)

Schema validation tested with jsonschema 4.25.1 (Draft 2020-12):
- kanban_complete: 5/5 cases match expected validity
- patch:           6/6 cases match expected validity

Handlers unchanged — they still tolerate both shapes for CLI legacy
callers that bypass the schema layer (e.g. `hermes kanban complete
--result "..."`).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jarvis-stark-ops jarvis-stark-ops merged commit 8ec396d into main Jun 6, 2026
jarvis-stark-ops pushed a commit that referenced this pull request Jun 10, 2026
…dict, reject respawn

Independent review of the first Part 6 commit found three BROKEN
issues plus several P1s. This commit addresses every one of them.

## BROKEN #1 — spawned review was stuck in todo (deadlock)

`create_task` overrides `initial_status` whenever parents are present
and any parent is not done. The umbrella stays `running` until the
review approves, but the review can never be claimed because parent
isn't done. Classic deadlock.

Fix: spawn the integrative review as a PEER task, not a child of the
umbrella. The relationship is tracked via the
`archive_blocked_pending_integrative_review` event payload
(`review_id`) and via title-prefix + tenant lookup. New review
correctly lands in `ready` and tchalla can claim it immediately.

## BROKEN #2 — substring verdict matcher was exploitable

The old check used `"verdict: approve" in result.lower()` which
matches inside prose like "After consideration my verdict: approve
would be wrong because...". Reviewers (or hostile patches) could
satisfy the gate without ever issuing a canonical verdict.

Fix: new `_v6_7_parse_verdict` uses a strict line-anchored regex
`^\s*verdict:\s*(approve|reject)\b` with `re.MULTILINE`. First match
wins, so `verdict: reject` before `verdict: approve` wins. Prose
mentions mid-sentence don't anchor.

## BROKEN #3 — reject path had no re-spawn flow

After a reject, `_v6_7_should_spawn_integrative_review` returned
False (existing review existed) and the umbrella was permanently
stuck. Operators had to manually delete the rejected review row.

Fix: `_should_spawn` now considers a done-with-reject review as
spawning-ready (after the orchestrator remediates). The next archive
call spawns round 2 with title suffix `:r2` (and `:r3`, etc.). The
event payload includes `supersedes` and `supersedes_verdict` so
operators can audit the round transitions.

## Other findings addressed

- BROKEN: `blocked` was in `terminal_statuses` — a blocked Friday
  means INCOMPLETE work. Removed `blocked` from the set; the gate
  now only treats `done` / `archived` as terminal for non-review
  children.
- BROKEN: `_v6_7_should_spawn_integrative_review` had no
  `has_non_review_child` check — review-only chains pathologically
  spawned. Added the check.
- BROKEN: Dead `IntegrativeReviewSpawned` dataclass was declared but
  never used. Deleted.
- WEAK: Integrative-review children were counted as "non-review
  children" in the previous title-match. The new loop explicitly
  skips integrative-review children by title prefix.
- WEAK: docstring for `archive_task` was missing. Added.

## Tests

Rewrote the test file. 27 new tests (up from 14):
- TestVerdictParser (8): line-anchored regex correctness, case-
  insensitive, prose-mention rejection, first-match-wins, empty
- TestSpawnTriggerConditions (8): canonical chain → peer review
  spawned with `ready` status (the critical assertion that catches
  the deadlock); non-goal_mode archives; orphan archives; review-
  only does NOT spawn (corrected from prior test); no-review chain
  doesn't spawn; in-flight build doesn't spawn; blocked build child
  doesn't qualify as terminal
- TestStateMachine (5): no double-spawn on in-flight, approve
  unblocks, reject triggers respawn on next archive (the critical
  fix), event emitted on pending, event includes `supersedes` on
  respawn
- TestSpawnedReviewBody (4): scope items present, umbrella id
  in body, strict verdict format documented, workspace inheritance
- TestVerdictBypassClosed (2): prose-approve does NOT unblock,
  canonical approve does unblock

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jarvis-stark-ops added a commit that referenced this pull request Jun 10, 2026
…sResearch#62, NousResearch#64)

Adds three pre-write-txn gates in `complete_task` mirroring the
existing `_verify_created_cards` / `HallucinatedCardsError` pattern:

- `verify_runtime_floor` (closes hermes-jarvis#64) — per-role floor
  on completed_at - started_at. Build 5min, review 90s, orchestration 0.
- `verify_workspace_diff` (closes hermes-jarvis#62) — non-review
  workers on dir/worktree workspaces must produce a non-empty
  git diff against the tracking base.
- `verify_no_stray_artifacts` (closes hermes-jarvis#28) — rejects
  *evidence*, commit-hash*, triage/*, tmp-*, and untracked
  no-extension/no-shebang files (the agent-dashboard PR #1
  "all prior block evidence files" failure mode).

Opt-outs via metadata (x_fast_justified / x_no_code / x_stray_ok)
require ≥20-char string reasons and emit completion_opt_out_used
audit events with verbatim reason. Truthy bools or short strings
rejected with InvalidOptOutError.

Context: hermes-jarvis#61 (bootstrap-paradox case study).

41 tests pass; 258 wider regression — zero failures.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jarvis-stark-ops pushed a commit that referenced this pull request Jun 10, 2026
Adds two new completion gates that fire alongside the Part 1/2 gates.

Closes hermes-jarvis#63 (PR-existence verification)
Closes hermes-jarvis#32 (doc-drift check)
Context: hermes-jarvis#61 (bootstrap-paradox case study)

## NousResearch#63 — verify_pr_urls_exist

When a verdict result or summary contains a GitHub PR URL pattern,
the dispatcher runs `gh pr view <url> --json number` to verify each
URL resolves. Phantom URLs (404 with "Not Found" / "Could not
resolve" / "no pull request" in stderr) reject the completion.

Indeterminate cases (gh missing, network error, unauthenticated) fall
open — workers can still complete in offline / broken-gh envs without
being trapped.

Catches the 2026-06-09 Tchalla case (hermes-jarvis#61): release-gate
reviewer blocked with "cannot run gh pr diff 42" on PR NousResearch#42 that
didn't exist. With this gate, his completion would have surfaced the
phantom URL specifically, prompting him to surface the real cause.

Opt-out: `metadata.x_phantom_pr_ok` with ≥20-char string reason.

## NousResearch#32 — verify_doc_drift

For tasks whose tenant slug encodes a version
(`marvel-swarm-vN-N-test`), the gate scans `README.md` / `README` in
the workspace for older `vX.Y` mentions outside a history-style
heading (`## History`, `## Older versions`, `## Previous versions`,
`## Archive`). Stale mentions reject the completion.

CHANGELOG.md is intentionally skipped — older versions are expected
there by definition.

Catches the 2026-06-09 agent-dashboard PR #1 case (hermes-jarvis#61):
README still said "v6.2 Marvel swarm test target" while the chain
was v6.6. No reviewer flagged it.

Opt-out: `metadata.x_doc_drift_ok` with ≥20-char string reason.

## Tests

19 new tests added to `test_kanban_completion_gates.py`:
- TestPRExistence (8) — no PR URL skipped, real passes, phantom
  rejects, mixed real+phantom flags only phantom, indeterminate falls
  open, summary scanned, dedup, opt-out
- TestDocDrift (10) — non-versioned tenant skips, no README skips,
  stale README rejects, current README passes, History section
  excused, CHANGELOG file excused, higher version not stale, scratch
  skipped, opt-out

83 passed in test_kanban_completion_gates.py (up from 64). Zero
regressions on adjacent paths.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jarvis-stark-ops pushed a commit that referenced this pull request Jun 11, 2026
…tics, kill dead loop, rename helper

Independent review of v6.8 Part 3 flagged three BROKEN issues + a
handful of polish items. All addressed.

## #1: LIKE pattern false-positive on summary_preview

The substring match ``payload LIKE '%RuntimeFloorViolation%'`` would
false-positive on a worker's summary_preview that mentioned the
class name in prose (e.g. "Acknowledged the prior
RuntimeFloorViolation; now fixing..."). Counter would inflate
incorrectly.

Fix: anchor the LIKE to the JSON ``"kind":`` field exactly:
``payload LIKE '%"kind": "RuntimeFloorViolation"%'``. Applied to
both _v6_7_count_prior_floor_rejections and the
_v6_7_heartbeat_floor_status lookup.

## #2: started_at reclaim semantics undocumented

verify_runtime_floor reads tasks.started_at which is set ONCE on
first claim (COALESCE in claim_task) and NOT updated on reclaim.
So a second-attempt worker may pass the floor by elapsed lifetime
even if its actual attempt was fast. Floor was designed against
first-attempt fabrication; reclaim usually means the chain has
been at it for a while already.

Fix: added a "Reclaim semantics" paragraph to the docstring of
verify_runtime_floor noting the anchor + the design rationale.
Future-fix would switch to task_runs.started_at for the active
run if this becomes a real problem.

## #3: dead code + unnecessary deferred import in
_v6_7_heartbeat_floor_status

Old code had a no-op for-loop iterating violations to "find" data
that was immediately overwritten from the tasks row, plus a
local ``from hermes_cli.kanban_completion_gates import
ROLE_RUNTIME_FLOORS_SECONDS`` despite the module already being
imported at top.

Fix:
- Rewrote helper to do a single existence-check on the event
  (changed SELECT payload → SELECT 1 since we don't parse it
  anymore).
- Removed the dead for-loop.
- Hoisted ROLE_RUNTIME_FLOORS_SECONDS to the top-level import
  block.
- Renamed function to _v6_7_heartbeat_floor_status (leading
  underscore for v6.7-internal convention) and kept the old name
  as an alias so existing callers keep working.

## Other polish

- Escalation message: changed "REJECTED FOR THE 2-th TIME" to
  "REJECTED #2" — same machine-readable count, no grammar
  awkwardness.
- Added 4 self-review gap-fill tests:
  - summary_preview with class name doesn't inflate counter
  - multi-violation event counts once (not twice)
  - count on unknown task returns zero
  - escalated message clamps negative seconds_remaining to "Wait 0s"

123/123 in test_kanban_completion_gates.py pass. 219/219 across
full v6.7+v6.8 + adjacent regression set, zero failures.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant