Skip to content

fix(ci): honor exact-head proof verdicts#83688

Merged
Takhoffman merged 1 commit into
mainfrom
fix/exact-head-proof-verdicts
May 18, 2026
Merged

fix(ci): honor exact-head proof verdicts#83688
Takhoffman merged 1 commit into
mainfrom
fix/exact-head-proof-verdicts

Conversation

@Takhoffman

Copy link
Copy Markdown
Contributor

Summary

Describe the problem and fix in 2-5 bullets:

  • Problem: Barnacle can remove proof: sufficient after ClawSweeper accepts real-behavior proof because Barnacle only trusts PR-body proof or proof: override.
  • Why it matters: ClawSweeper automerge can get stuck when the label is stripped and the required Real behavior proof check does not consume ClawSweeper's exact-head verdict.
  • What changed: The shared proof policy now recognizes ClawSweeper clawsweeper-verdict:pass markers only when they match the PR number and current head SHA; Barnacle preserves proof: sufficient for that exact-head verdict, and the required proof check accepts it.
  • What did NOT change (scope boundary): This does not weaken body-based proof validation, maintainer override behavior, or stale proof removal when the ClawSweeper verdict is missing or for a different head.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor required for the fix
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

Real behavior proof (required for external PRs)

External contributors must show after-fix evidence from a real OpenClaw setup. Unit tests, mocks, lint, typechecks, snapshots, and CI are supplemental only. Screenshots are encouraged even for CLI, console, text, or log changes; terminal screenshots and copied live output count. Be mindful of private information like IP addresses, API keys, phone numbers, non-public endpoints, or other private details when providing evidence.

  • Behavior or issue addressed: Barnacle proof-label synchronization and the required proof check now honor an exact-head ClawSweeper pass verdict.
  • Real environment tested: Local OpenClaw checkout on macOS with Node/pnpm repo test wrappers.
  • Exact steps or command run after this patch: pnpm test test/scripts/real-behavior-proof-policy.test.ts test/scripts/barnacle-auto-response.test.ts; pnpm exec oxfmt --check --threads=1 .github/workflows/real-behavior-proof.yml scripts/github/real-behavior-proof-policy.mjs scripts/github/real-behavior-proof-check.mjs scripts/github/barnacle-auto-response.mjs test/scripts/real-behavior-proof-policy.test.ts test/scripts/barnacle-auto-response.test.ts; git diff --check origin/main...HEAD.
  • Evidence after fix (screenshot, recording, terminal capture, console output, redacted runtime log, linked artifact, or copied live output): Focused tests passed: 2 Vitest shards, 2 test files, 50 tests. Format check and diff whitespace check passed.
  • Observed result after fix: Tests prove exact-head ClawSweeper verdict markers are accepted only for the matching PR/head and Barnacle preserves proof: sufficient on synchronize when the marker is valid.
  • What was not tested: A live GitHub Actions run against a PR comment marker before opening this PR.
  • Before evidence (optional but encouraged): PR fix(models): respect auth order in picker label #83581 timeline showed Barnacle removing proof: sufficient after ClawSweeper applied it, while Real behavior proof remained failing.

Root Cause (if applicable)

For bug fixes or regressions, explain why this happened, not just what changed. Otherwise write N/A. If the cause is unclear, write Unknown.

  • Root cause: Barnacle and the Real behavior proof workflow used the structured PR body proof policy but did not read ClawSweeper's durable exact-head verdict marker.
  • Missing detection / guardrail: No test covered the cross-bot contract where ClawSweeper accepts proof in a marker and Barnacle later reevaluates labels on synchronize.
  • Contributing context (if known): Barnacle intentionally removes stale proof: sufficient on PR updates, but it had no way to distinguish stale labels from an exact-head ClawSweeper pass.

Regression Test Plan (if applicable)

For bug fixes or regressions, name the smallest reliable test coverage that should catch this. Otherwise write N/A.

  • Coverage level that should have caught this:
    • Unit test
    • Seam / integration test
    • End-to-end test
    • Existing coverage already sufficient
  • Target test or file: test/scripts/real-behavior-proof-policy.test.ts; test/scripts/barnacle-auto-response.test.ts
  • Scenario the test should lock in: Exact-head ClawSweeper pass markers count as proof only for the matching PR/head, and Barnacle preserves proof: sufficient on synchronize when that marker exists.
  • Why this is the smallest reliable guardrail: The bug was in local proof-policy and Barnacle label synchronization logic, so unit/tooling tests cover the contract without needing a live PR.
  • Existing test that already covers this (if any): None before this PR.
  • If no new test is added, why not: N/A

User-visible / Behavior Changes

Automerge/proof automation can now proceed from an exact-head ClawSweeper proof verdict instead of racing over proof: sufficient label churn.

Diagram (if applicable)

Before:
ClawSweeper pass marker -> proof: sufficient -> Barnacle synchronize check -> label removed
PR body lacks structured proof -> Real behavior proof check fails

After:
ClawSweeper pass marker for current head -> Barnacle preserves proof: sufficient
ClawSweeper pass marker for current head -> Real behavior proof check passes

Security Impact (required)

  • New permissions/capabilities? (Yes/No) Yes
  • Secrets/tokens handling changed? (Yes/No) No
  • New/changed network calls? (Yes/No) Yes
  • Command/tool execution surface changed? (Yes/No) No
  • Data access scope changed? (Yes/No) Yes
  • If any Yes, explain risk + mitigation: The proof workflow reads PR issue comments using the existing app token so it can find ClawSweeper's marker. The marker is accepted only when the PR number and exact 40-character head SHA match, limiting stale or copied-marker risk.

Repro + Verification

Environment

  • OS: macOS
  • Runtime/container: Local repo Node/pnpm test wrappers
  • Model/provider: N/A
  • Integration/channel (if any): GitHub Actions/Barnacle/ClawSweeper automation
  • Relevant config (redacted): N/A

Steps

  1. Add a ClawSweeper pass verdict marker for a matching PR number and head SHA.
  2. Run the proof policy and Barnacle auto-response tests.
  3. Confirm mismatched head SHAs are rejected and matching markers preserve proof: sufficient.

Expected

  • Matching exact-head ClawSweeper pass markers count as sufficient proof.
  • Mismatched markers do not count.
  • Barnacle preserves proof: sufficient only when the exact-head marker exists.

Actual

  • Focused tests passed with those assertions.

Evidence

Attach at least one:

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording
  • Perf numbers (if relevant)

Human Verification (required)

What you personally verified (not just CI), and how:

  • Verified scenarios: Focused proof-policy tests and Barnacle label-sync tests passed locally.
  • Edge cases checked: Marker SHA mismatch does not count as proof.
  • What you did not verify: Live GitHub Actions execution before opening this PR.

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

If a bot review conversation is addressed by this PR, resolve that conversation yourself. Do not leave bot review conversation cleanup for maintainers.

Compatibility / Migration

  • Backward compatible? (Yes/No) Yes
  • Config/env changes? (Yes/No) No
  • Migration needed? (Yes/No) No
  • If yes, exact upgrade steps: N/A

Risks and Mitigations

  • Risk: A copied ClawSweeper marker could otherwise be mistaken for proof.
    • Mitigation: The parser requires both the PR number and exact current head SHA to match.

@openclaw-barnacle openclaw-barnacle Bot added scripts Repository scripts size: S maintainer Maintainer-authored PR labels May 18, 2026
@clawsweeper

clawsweeper Bot commented May 18, 2026

Copy link
Copy Markdown
Contributor

Codex review: needs real behavior proof before merge.

Workflow note: Future ClawSweeper reviews update this same comment in place.

How this review workflow works
  • ClawSweeper keeps one durable marker-backed review comment per issue or PR.
  • Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
  • A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
  • PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
  • Maintainers can also comment @clawsweeper review to request a fresh review only.
  • Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
  • Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
  • Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

Summary
The PR adds exact-head ClawSweeper proof-verdict marker handling to the real-behavior-proof check, Barnacle proof-label synchronization, and related tests.

Reproducibility: yes. by source inspection and related PR timeline. Current main does not read ClawSweeper verdict comments in the required proof check, and the related PR at #83581 shows an exact-head pass marker while the Real behavior proof check remained failing.

PR rating
Overall: 🧂 unranked krab
Proof: 🧂 unranked krab
Patch quality: 🧂 unranked krab
Summary: Not merge-ready because proof is mock-only and the patch weakens the proof-gate trust boundary unless verdict comments are authenticated.

Rank-up moves:

  • Accept only ClawSweeper app-authored verdict comments and add a forged-comment regression test.
  • Use a token with issue-comment read permission, or retry with GITHUB_TOKEN, when fetching proof comments.
  • Add live workflow output or redacted terminal/log proof showing the exact-head marker path after the fix.
What the crustacean ranks mean
  • 🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
  • 🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
  • 🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
  • 🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
  • 🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
  • 🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
  • 🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

PR egg
🎁 Pass real behavior proof to wake the egg and unlock a hatchable treat.

Where did the egg go?
  • The egg game starts only after the PR passes the real-behavior proof check.
  • Before that, no creature, rarity, or ASCII portrait is rolled. The treat waits for real proof.
  • This is still just collectible flavor: proof affects review readiness, not creature quality.

Real behavior proof
Needs real behavior proof before merge: Mock-only: the PR body reports focused tests, format, and diff checks, but no live Actions output or redacted terminal/log proof for the new comment-marker path; the contributor should add real proof and update the PR body or ask for @clawsweeper re-review.

Risk before merge
Why this matters: - As written, an external PR commenter could satisfy the required proof check by posting a hidden marker with the current PR number and head SHA.

  • The exact-head path may fail in Actions because the script prefers an app token requested only with permission-members: read before using the job token that has issues: read.
  • The PR body provides tests and format checks only, not live workflow output or redacted runtime evidence that the new GitHub comment-marker path works.

Maintainer options:

  1. Authenticate verdict markers before merge (recommended)
    Require the accepted marker comment to come from the ClawSweeper app or bot identity, add a forged-contributor-comment rejection test, and fix the issue-comment token path.
  2. Pause if comment markers are not the durable contract
    If maintainers do not want issue comments to be a trusted proof source, pause or close this approach and choose another proof handoff mechanism.

Next step before merge
Protected maintainer handling and missing real behavior proof are required, and the patch needs security-sensitive proof-marker trust fixes before merge.

Security
Needs attention: Needs attention: the new proof-verdict parser trusts unauthenticated issue-comment bodies as ClawSweeper pass markers.

Review findings

  • [P1] Authenticate ClawSweeper verdict comments — scripts/github/real-behavior-proof-policy.mjs:252-259
  • [P2] Use an issue-readable token for comment fetching — scripts/github/real-behavior-proof-check.mjs:54
Review details

Best possible solution:

Keep the exact-head verdict idea, but only trust markers from the ClawSweeper GitHub App or bot identity, use a token that can read issue comments with a safe fallback, and add forged-comment plus token-path regression coverage before merge.

Do we have a high-confidence way to reproduce the issue?

Yes by source inspection and related PR timeline. Current main does not read ClawSweeper verdict comments in the required proof check, and the related PR at #83581 shows an exact-head pass marker while the Real behavior proof check remained failing.

Is this the best way to solve the issue?

No. The direction is maintainable, but this implementation must authenticate the marker source and use a token with issue-comment read permission before it is safe to merge.

Label justifications:

  • P1: The PR changes required proof/automerge gating and currently contains a proof-gate trust-boundary bug that should block merge.
  • merge-risk: 🚨 security-boundary: Merging this PR could let untrusted PR commenters spoof a ClawSweeper verdict marker and bypass required proof.
  • merge-risk: 🚨 automation: Merging this PR could keep proof automation failing because the new comment-read path may use a token without issue-comment permission.

Full review comments:

  • [P1] Authenticate ClawSweeper verdict comments — scripts/github/real-behavior-proof-policy.mjs:252-259
    hasClawSweeperExactHeadProof scans every issue comment body and accepts a matching PR number plus head SHA, but it never verifies that the comment was posted by the ClawSweeper GitHub App or bot. An external contributor can post that hidden marker themselves and make the required Real behavior proof check pass without accepted proof, so the marker should be accepted only from a trusted ClawSweeper comment and covered with a forged-comment regression test.
    Confidence: 0.96
  • [P2] Use an issue-readable token for comment fetching — scripts/github/real-behavior-proof-check.mjs:54
    The new comment fetch prefers appToken whenever it exists, but the workflow creates that app token with only permission-members: read; the added job-level issues: read applies to GITHUB_TOKEN, which this code will skip. In the intended exact-head fallback path, a members-only app token can make the check throw on the comments API instead of reading the marker, so request permission-issues: read for the app token or retry with GITHUB_TOKEN on permission failure.
    Confidence: 0.88

Overall correctness: patch is incorrect
Overall confidence: 0.91

Security concerns:

  • [high] Contributor comments can forge proof verdicts — scripts/github/real-behavior-proof-policy.mjs:252
    The parser accepts any comment containing a matching clawsweeper-verdict:pass marker, so a PR commenter can bypass the real-behavior-proof gate by posting the hidden marker with the current head SHA.
    Confidence: 0.96

What I checked:

  • Current-main proof check does not read verdict comments: On current main, the required proof check exits after PR-body proof or override evaluation and has no issue-comment marker path, matching the reported missing contract. (scripts/github/real-behavior-proof-check.mjs:48, 38f11a084475)
  • Current-main Barnacle removes sufficient proof on synchronize: Current Barnacle logic removes proof: sufficient on edited/synchronize events even when the PR body proof passed; blame ties this behavior to the existing automation implementation. (scripts/github/barnacle-auto-response.mjs:787, 67f8683ca30a)
  • PR parser trusts all comment bodies: The PR head scans every comment body for clawsweeper-verdict:pass and accepts a matching PR number plus SHA without checking the comment author or GitHub App identity. (scripts/github/real-behavior-proof-policy.mjs:252, 1ad980ae7421)
  • PR check prefers app token for comment reads: The PR head uses appToken || process.env.GITHUB_TOKEN before fetching issue comments, while the workflow's app token step requests only permission-members: read. (scripts/github/real-behavior-proof-check.mjs:54, 1ad980ae7421)
  • Workflow permission context: The PR adds job-level issues: read, but the app-token creation steps still request only permission-members: read, so the script's token preference matters. (.github/workflows/real-behavior-proof.yml:21, 1ad980ae7421)
  • GitHub App token contract: The actions/create-github-app-token documentation says permission-<permission> inputs explicitly grant selected permissions; this supports treating the members-only app token as insufficient for issue-comment reads unless issues permission is also requested.

Likely related people:

  • steipete: Git blame attributes the current real-behavior-proof policy, workflow skeleton, and Barnacle proof-label removal logic to Peter Steinberger's automation commit. (role: introduced current proof automation surface; confidence: high; commits: 67f8683ca30a; files: scripts/github/real-behavior-proof-policy.mjs, scripts/github/real-behavior-proof-check.mjs, scripts/github/barnacle-auto-response.mjs)
  • Dallin Romney: Recent history shows Dallin Romney added the maintainer-team app-token path in the same real-behavior-proof workflow, which is adjacent to this PR's comment-fetch token behavior. (role: recent area contributor; confidence: medium; commits: cf194419c315; files: .github/workflows/real-behavior-proof.yml)

Codex review notes: model gpt-5.5, reasoning high; reviewed against 38f11a084475.

@Takhoffman Takhoffman merged commit e4fba78 into main May 18, 2026
106 of 111 checks passed
@clawsweeper clawsweeper Bot added the rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. label May 18, 2026
@Takhoffman Takhoffman deleted the fix/exact-head-proof-verdicts branch May 18, 2026 16:39
@clawsweeper clawsweeper Bot added status: 🛠️ actively grinding The PR author has acted after the latest ClawSweeper review and work remains. P1 High-priority user-facing bug, regression, or broken workflow. merge-risk: 🚨 security-boundary 🚨 May affect sandboxing, authorization, credentials, or sensitive data. merge-risk: 🚨 automation 🚨 May affect CI, automerge, proof capture, label sync, or maintainer automation. labels May 18, 2026
SebTardif pushed a commit to SebTardif/openclaw that referenced this pull request May 24, 2026
SebTardif pushed a commit to SebTardif/openclaw that referenced this pull request May 24, 2026
SebTardif pushed a commit to SebTardif/openclaw that referenced this pull request May 24, 2026
github-actions Bot pushed a commit to Desicool/openclaw that referenced this pull request May 24, 2026
galiniliev pushed a commit to galiniliev/openclaw that referenced this pull request May 25, 2026
SebTardif pushed a commit to SebTardif/openclaw that referenced this pull request May 26, 2026
SebTardif pushed a commit to SebTardif/openclaw that referenced this pull request May 26, 2026
SebTardif pushed a commit to SebTardif/openclaw that referenced this pull request May 26, 2026
jameslcowan pushed a commit to jameslcowan/openclaw that referenced this pull request Jun 2, 2026
SYU8384 pushed a commit to SYU8384/openclaw that referenced this pull request Jun 3, 2026
sablehead pushed a commit to sablehead/openclaw that referenced this pull request Jun 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

maintainer Maintainer-authored PR merge-risk: 🚨 automation 🚨 May affect CI, automerge, proof capture, label sync, or maintainer automation. merge-risk: 🚨 security-boundary 🚨 May affect sandboxing, authorization, credentials, or sensitive data. P1 High-priority user-facing bug, regression, or broken workflow. rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. scripts Repository scripts size: S status: 🛠️ actively grinding The PR author has acted after the latest ClawSweeper review and work remains.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant