Skip to content

feat(a2a): A2A Protocol — Mega-branch consolidation#43

Open
dgarson wants to merge 12 commits intodgarson/forkfrom
a2a-protocol
Open

feat(a2a): A2A Protocol — Mega-branch consolidation#43
dgarson wants to merge 12 commits intodgarson/forkfrom
a2a-protocol

Conversation

@dgarson
Copy link
Owner

@dgarson dgarson commented Feb 21, 2026

A2A Protocol — Mega-Branch PR

This is the consolidation PR for the entire A2A Protocol feature. It aggregates all sub-PRs once they've been reviewed and merged into the a2a-protocol branch.

Sub-PRs (targeting a2a-protocol):

Process

  1. Tim reviews each sub-PR (feat(a2a): Workstream A — Schema, types, and validator #37-41), leaving a comment before starting
  2. Issues are resolved, then sub-PRs merge into a2a-protocol
  3. This mega-PR stays open for David's final review
  4. Once David approves, this merges into dgarson/fork

What's Included

  • Schema & Validation: Zod-based A2A envelope schema with strict validation
  • Message Router: Pub/sub routing with rate limiting and circuit breaker patterns
  • Agent SDK: High-level send/receive helpers for agents
  • Audit Logging: Append-only audit log with querying and rotation
  • Integration Tests: End-to-end tests covering the full message lifecycle

@dgarson
Copy link
Owner Author

dgarson commented Feb 22, 2026

Run Codex sweep first, then request final architecture review.

dgarson added a commit that referenced this pull request Feb 22, 2026
WebhookManager (Wes): endpoint grid, delivery log table, expandable rows, add modal
ConversationHistory (Quinn): master/detail session archive, message replay, export

Sprint total: 44 views
@dgarson
Copy link
Owner Author

dgarson commented Feb 23, 2026

Architecture + Code Review (full pass on PR #43)

I did a full read of the A2A implementation + tests and also reviewed the non-A2A deltas included in this PR.

Overall verdict

Request changes before merge.

The A2A core is directionally strong (types/schema/validator/router/audit/tests are all present and reasonably cohesive), but there are several merge blockers and a few reliability gaps that should be fixed first.


✅ What looks good

  • Clean split of protocol layers: schema/types/validator/router/audit/SDK.
  • Router pipeline is easy to follow and test.
  • Circuit breaker includes both pair-rate and correlation-depth protection.
  • Integration tests cover realistic multi-agent flows (task lifecycle, review loop, rate limit, circuit breaker, audit integrity).

❌ Blockers

1) Scope contamination in a mega-branch that is labeled as A2A consolidation

This PR includes 21 non-A2A files in addition to A2A changes (session auto-labeling, sessions_send timeout config, config schema changes, retry-policy/export fixes, etc.).

Examples:

  • src/sessions/session-auto-label.ts
  • src/agents/tools/sessions-send-tool.ts
  • src/config/*
  • src/infra/retry-policy.ts

For a consolidation PR intended to represent A2A workstreams (#37#41), this significantly increases blast radius and makes regression triage harder.

Required fix: split non-A2A changes into separate PR(s), or rebase this branch to include only A2A scope.


2) Router has an unsafe no-validator path and inconsistent audit behavior

In src/gateway/a2a/router.ts:

  • route() allows validate to be omitted, then casts input as A2AMessage and dereferences message.from.agentId directly (lines 135-141). Invalid input can throw before controlled error handling/audit.
  • On validation_failed, router returns early without audit logging (lines 123-133), while all other non-success outcomes are audited.

This creates observability blind spots and can produce uncaught runtime failures when validation is disabled.

Required fix:

  • Make validation mandatory in production path (or add hard runtime guards before dereference).
  • Audit validation failures too (at least metadata + error summary).

3) SDK silently drops messages when send function is not set

In src/gateway/a2a/sdk.ts, send() returns the message even if sendFn is null (lines 88-93).

That is dangerous: callers can interpret this as successful send while nothing was delivered.

Required fix: throw explicit error when sendFn is unset (similar to the context guard pattern already used for getContext()).


⚠️ Important reliability gaps (should fix before merge if possible)

4) Circuit breaker pair-key collision risk

pairKey() uses "${sorted[0]}:${sorted[1]}" (src/gateway/a2a/circuit-breaker.ts). If agent IDs can include :, distinct pairs can collide.

Fix: use a collision-safe key encoding (e.g., JSON stringify tuple of sorted IDs, or a length-prefixed separator).

5) Envelope validation is looser than comments/spec imply

In src/gateway/a2a/validator.ts, timestamp is only checked as non-empty string (lines 126-139), despite error text claiming ISO 8601. Also top-level unknown envelope properties are not rejected.

Fix: enforce timestamp format and consider envelope additionalProperties: false behavior.

6) Audit query pagination bounds and corruption tolerance

In src/gateway/a2a/audit-query.ts, limit/offset are not clamped to non-negative (lines 117-118), and in audit.ts, one malformed JSONL line causes readLogFile() to throw entire file read (line 141).

Fix:

  • clamp limit >= 0, offset >= 0
  • tolerate malformed lines (skip+count/report) to keep query usable under partial log corruption.

Test coverage notes (gaps)

Current tests are solid on happy paths and key controls, but missing cases for:

  • validator-omitted router path with malformed input
  • validation-failed audit emission
  • SDK behavior when send function not registered
  • circuit-breaker pair-key collision inputs
  • negative pagination inputs in audit query
  • malformed JSONL line tolerance

Merge readiness

Given the blockers above, especially scope contamination + unsafe/no-audit behaviors, I do not recommend merging this as-is.

If you want, I can provide a concrete split plan (A2A-only PR cut + follow-up PRs for session-label/timeout/config changes) so this can be landed safely and quickly.

@dgarson
Copy link
Owner Author

dgarson commented Feb 23, 2026

Run Codex sweep first, then request final architecture review.

dgarson and others added 10 commits February 24, 2026 14:47
…it breaker

Implements the A2A message router, rate limiter, and circuit breaker.

Files:
- src/gateway/a2a/types.ts — Shared types (from Workstream A, will be consolidated)
- src/gateway/a2a/rate-limiter.ts — Per-agent rate limiting with configurable windows
- src/gateway/a2a/circuit-breaker.ts — Loop detection via correlation depth + pair message rates
- src/gateway/a2a/router.ts — Full routing pipeline: validate → self-send check → rate limit → circuit breaker → deliver → audit
- test/a2a/router.test.ts — 30 tests, all passing

Features:
- Pluggable delivery, audit, and validation functions
- Per-agent rate limiting (configurable max/window)
- Circuit breaker with correlation depth and pair message flood detection
- Order-independent pair tracking (A→B and B→A count together)
- Configurable cooldown for tripped circuits
- Metrics counters for monitoring
- Graceful audit failure handling (doesn't break routing)

Ref: /Users/openclaw/.openclaw/workspace/_shared/specs/a2a-communication-protocol.md
Implements the full type system and validation layer for the Agent-to-Agent
Communication Protocol (openclaw.a2a.v1).

Files:
- src/gateway/a2a/types.ts — TypeScript types for all 7 message types + envelope
- src/gateway/a2a/schema.ts — JSON Schema definitions for runtime validation (ajv)
- src/gateway/a2a/validator.ts — validateA2AMessage() with descriptive errors
- src/gateway/a2a/index.ts — Barrel export
- test/a2a/validator.test.ts — 82 tests, 100% pass

Includes semantic validation:
- task_response requires reason when action is declined/failed/blocked
- review_response requires unresolvedConcerns when verdict is changes_requested

Ref: /Users/openclaw/.openclaw/workspace/_shared/specs/a2a-communication-protocol.md
Implements the audit logging and query subsystem for the Agent-to-Agent
Communication Protocol.

Files:
- src/gateway/a2a/audit-types.ts — Types for audit entries, queries, results
- src/gateway/a2a/audit.ts — JSONL logger with daily + size-based rotation
- src/gateway/a2a/audit-query.ts — Query interface with filtering and pagination
- test/a2a/audit.test.ts — 31 tests, 100% pass

Features:
- Daily log rotation (a2a-YYYY-MM-DD.jsonl)
- Size-based rotation at 50MB
- Concurrent write safety via per-file locking
- Efficient date-range-based file scanning for queries
- Filter by agentId, type, date range, correlationId, priority
- Pagination with limit/offset

Ref: /Users/openclaw/.openclaw/workspace/_shared/specs/a2a-communication-protocol.md
listLogFiles was using naive lexicographic sort, which placed
a2a-YYYY-MM-DD.1.jsonl before a2a-YYYY-MM-DD.jsonl (because '.' < 'j').
This is wrong: the base file fills up first, then overflow .1, .2 etc.

Now parses the date and rotation index from filenames and sorts
base file before its overflow files within the same day.

Reviewed-by: Xavier (CTO)
End-to-end test suite covering all major multi-agent communication flows.
12 integration tests, all passing. 182 total tests across all workstreams.

Ref: /Users/openclaw/.openclaw/workspace/_shared/specs/a2a-communication-protocol.md
* feat: add acp-handoff skill — automatic completion-to-review handoff

Phase 1 implementation of the Automatic Completion Protocol (ACP).
When agents complete work, this skill automates:
- PR creation via gh CLI
- Reviewer auto-selection based on tier pipeline + squad affinity
- SLA tracking with escalation (P0: 15min, P1: 30min, P2: 2h, P3: 8h)
- SQLite-backed handoff state tracking
- Companion scripts for status checking, escalation, and completion

Scripts:
- acp-handoff.sh: Main handoff workflow
- acp-status.sh: Query pending/overdue reviews
- acp-escalate.sh: Cron-driven escalation checks
- acp-complete.sh: Mark reviews as done

Requested by David to fix the gap where agents say 'ready for review'
but no PR is opened and no reviewer is assigned.

* fix: handle fork workflow in acp-handoff.sh

Detect fork vs upstream remotes and use FORK_OWNER:branch format
for cross-fork PRs. Also handle gh pr create output format
(returns URL string, not JSON).

* fix: CRITICAL — hardcode PR target to origin, block upstream PRs

NEVER open PRs against openclaw/openclaw. Always target dgarson/clawdbot.
Added explicit safety check that blocks any attempt to target upstream.
Also updated WORK_PROTOCOL.md with new Section 6: PR Target Rule.

* fix: harden ACP script edge cases (empty sqlite JSON + PR parsing)
Comprehensive code review covering all 5 workstreams (schema/validator,
router, SDK, audit, integration tests) plus ACP handoff skill.

Identifies 5 critical issues (envelope allows unknown fields, timestamps
not validated as ISO 8601, SDK sends without validation, router validation
optional/fails-open, audit types duplicate source-of-truth), 6 moderate
issues (singleton SDK, no half-open circuit breaker state, unbounded
correlation tracking, fixed-window rate limiting, unbounded log rotation
loop, in-memory audit queries), and 6 minor/style issues.

Prioritized recommendations included.

https://claude.ai/code/session_01NydDWH98mCn3823zG5VRe8

Co-authored-by: Claude <noreply@anthropic.com>
dgarson added a commit that referenced this pull request Mar 3, 2026
…, SDK silent drop

- Router: Add structural guard before unsafe dereference when validator not provided
- Router: Add audit logging for validation_failed (was missing, creating blind spots)
- SDK: Throw explicit error when sendFn is unset (was silently returning message)

These are the 3 critical blockers from David's architecture review.
dgarson added a commit that referenced this pull request Mar 3, 2026
…, SDK

- Router: Add structural guard before unsafe dereference when validator not provided
- Router: Add audit logging for validation_failed (was missing, creating blind spots)
- SDK: Throw explicit error when sendFn is unset (was silently returning message)

These are the 3 critical blockers from David's architecture review.
Merged by Xavier to unblock A2A mega-branch progression.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant