Skip to content

refactor(workflows): consolidate duplicated safeSendMessage into executor-shared#1496

Merged
Wirasm merged 2 commits into
coleam00:devfrom
ericsoriano:refactor/consolidate-safe-send-message
May 18, 2026
Merged

refactor(workflows): consolidate duplicated safeSendMessage into executor-shared#1496
Wirasm merged 2 commits into
coleam00:devfrom
ericsoriano:refactor/consolidate-safe-send-message

Conversation

@ericsoriano

@ericsoriano ericsoriano commented Apr 29, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Problem: executor.ts and dag-executor.ts each define a near-identical copy of safeSendMessage and SendMessageContext. The UnknownErrorTracker guard was declared in executor.ts but is dormant — no callsite in either file instantiates a tracker today.
  • Why it matters: code duplication in a hot path (every workflow's platform-message dispatch). Future work to actually enable the abort guard in DAG execution becomes simpler once the helper has a single home.
  • What changed: Moved SendMessageContext, UnknownErrorTracker, and safeSendMessage into executor-shared.ts as exports. Both executor.ts and dag-executor.ts now import from there. No behavior change — neither file passes a tracker to safeSendMessage, so the guard remains dormant infrastructure (same as upstream/dev today).
  • What did NOT change (scope boundary): No new UnknownErrorTracker instantiations. No callsite signatures changed. No tests changed; existing suite covers all callsites. Activating the guard in DAG execution is appropriate follow-up work — see "Note for maintainer" below.

UX Journey

This is a refactor below the user-facing layer; runtime behavior is identical. The UX journey diagram below is what would happen if a tracker were eventually wired up — not what happens today.

Today (this PR)

DAG workflow runs ──▶ executor-shared.safeSendMessage(platform, conv, msg)
                       │ (no tracker passed — same as before this PR)
                       ├─ FATAL error  → throw (abort)        ✅ unchanged
                       ├─ TRANSIENT    → log, return false, continue   ✅ unchanged
                       └─ UNKNOWN      → log, return false, continue   ✅ unchanged

Future (if a follow-up wires up a tracker)

DAG workflow runs ──▶ executor-shared.safeSendMessage(platform, conv, msg, ctx, meta, tracker)
                       │
                       ├─ FATAL error  → throw (abort)
                       ├─ TRANSIENT    → log, reset tracker.count, continue
                       └─ UNKNOWN      → log, increment tracker.count;
                                          after UNKNOWN_ERROR_THRESHOLD consecutive UNKNOWNs:
                                              throw "3 consecutive unrecognized errors"

Architecture Diagram

Before

executor.ts                           executor-shared.ts                  dag-executor.ts
─────                                 ─────                               ─────
interface SendMessageContext          classifyError                       interface SendMessageContext  (DUP)
interface UnknownErrorTracker         detectCompletionSignal              async safeSendMessage         (DUP)
async safeSendMessage                 stripCompletionTags
                                      isInlineScript
                                      formatSubprocessFailure

After

executor.ts                           executor-shared.ts                  dag-executor.ts
─────                                 ─────                               ─────
imports ────────────▶                 [+] export interface SendMessageContext   ◀──────── imports
                                      [+] export interface UnknownErrorTracker
                                      [+] export async safeSendMessage
                                      classifyError (existing)
                                      … (other existing exports)

Connection inventory:

From To Status Notes
executor.ts executor-shared.ts modified adds imports for the three symbols; deletes local copies; inlines the small logSendError helper
dag-executor.ts executor-shared.ts modified adds imports for safeSendMessage + type SendMessageContext; deletes local copies
executor-shared.ts (consumers) modified newly exports the three symbols

Label Snapshot

  • Risk: risk: low
  • Size: size: M
  • Scope: workflows
  • Module: workflows:executor-shared

Change Metadata

  • Change type: refactor
  • Primary scope: workflows

Linked Issue

Validation Evidence (required)

bun run validate
  • check:bundled passes
  • type-check passes (all packages)
  • lint passes (0 errors, 0 warnings)
  • format:check passes
  • ✅ Test suite passes (per-package isolated)

Evidence: bun run validate exit code 0 against this branch (rebased onto upstream/dev at 2945f2ec, then re-validated after addressing review).

Security Impact (required)

  • New permissions/capabilities? No
  • New external network calls? No
  • Secrets/tokens handling changed? No
  • File system access scope changed? No

Compatibility / Migration

  • Backward compatible? Yes — all callsites of safeSendMessage retain identical signatures and behavior. The function still accepts an optional UnknownErrorTracker (no callsite passes one today, in either file, on either upstream/dev or this branch).
  • Config/env changes? No
  • Database migration needed? No

Human Verification (required)

The duplication was surfaced by running archon-architect against this codebase as the architectural-review test for PR #1479. The workflow's analysis identified safeSendMessage duplication as the most acute correctness drift in the executor pair. The resulting commit was filed as a verification PR on my fork (now closed) before being lifted to this upstream contribution and updated based on review.

  • Verified scenarios:
    • All callsites in both files compile + type-check correctly after the import-only changes.
    • Existing test suite passes — covers safeSendMessage happy path, FATAL throw, and TRANSIENT log+continue.
    • Rebase on upstream/dev produced one trivial import-block conflict in dag-executor.ts (combining a recently-added formatSubprocessFailure import with the new safeSendMessage / type SendMessageContext imports), resolved by keeping all four. No semantic conflict.
  • Edge cases checked:
    • The UnknownErrorTracker is and was dormant on both sides — no callsite instantiates one. The guard's threshold check (UNKNOWN_ERROR_THRESHOLD = 3) is unchanged from the pre-existing executor.ts logic.
    • Per review feedback: the tracker now correctly resets on any non-UNKNOWN outcome (not only on success), so a UNKNOWN → TRANSIENT → UNKNOWN → UNKNOWN sequence no longer trips the threshold incorrectly.
  • What was not verified:
    • Live simulation of the UNKNOWN_ERROR_THRESHOLD path. Code path verified statically; no callsite instantiates a tracker, so the path is unreachable in production today.

Side Effects / Blast Radius (required)

  • Affected subsystems: every DAG workflow's platform-send path (Slack, Telegram, web, etc.) and every legacy executor send. DAG executor hosts all modern Archon workflows.
  • Behavioral effects: none. Both files dispatch through safeSendMessage exactly as before; only the symbol's source file has changed.
  • Guardrails: existing logging ('platform_message_send_failed') and the new structured 'platform.critical_message_send_failed' event continue to surface every send failure for observability.

Rollback Plan (required)

  • Fast rollback command: git revert <merge-sha> — single commit, three files, no schema or external interface changes outside this scope.
  • Feature flags or config toggles: None — change is internal type/function relocation.
  • Observable failure symptoms: TypeScript build errors on executor.ts or dag-executor.ts (missing imports), or duplicated symbols in executor-shared.ts. CI's existing type-check step would catch all of these.

Risks and Mitigations

  • Risk: in-flight upstream refactors of executor.ts / dag-executor.ts produce additional import-block conflicts before merge.
    • Mitigation: this PR's diff is import-block + delete-only; conflicts will be mechanical to resolve. Recent maintainer activity in this area is concentrated on dag-executor.ts business logic, not the import block; risk is low.

Updates from review

CodeRabbit raised three items; addressed in commit amend:

  1. Reset tracker on non-UNKNOWN (executor-shared.ts:599): pre-existing bug inherited from executor.ts. Fixed — tracker now resets on any non-UNKNOWN error type, not only on success.
  2. Structured log event name (executor.ts:64): renamed 'Critical message send failed''platform.critical_message_send_failed' to match the repo's domain.action_state Pino convention.
  3. 📝 DAG path doesn't enable the UNKNOWN abort guard: see "Note for maintainer" below — kept this PR scoped to deduplication; happy to wire up the tracker if maintainers prefer.

Note for maintainer

I deliberately kept this PR scoped to extraction so the refactor is review-friendly and easy to revert. The UnknownErrorTracker guard was already dormant on upstream/dev before this PR (no callsite in executor.ts instantiated one); this PR preserves that state in the consolidated location.

If you'd prefer this PR also wire up the UnknownErrorTracker in dag-executor.ts (and possibly executor.ts's callsites) to actually deliver the abort behavior in production, happy to add that on top — just let me know which scope you'd prefer. Otherwise I'd suggest activation as a follow-up PR so the design choices (per-conversation tracker? per-DAG-run? where does it live?) get their own discussion.

Summary by CodeRabbit

  • Improvements

    • Centralized and hardened message-sending with richer failure diagnostics and consistent handling across workflows
    • Automatic protection against repeated unknown send errors via a configurable threshold
    • Enhanced logging for delivery failures including platform, error classification, message and contextual metadata
  • Tests

    • Added test coverage for success, transient, unknown, and fatal send failure behaviors and threshold handling

@coderabbitai

coderabbitai Bot commented Apr 29, 2026

Copy link
Copy Markdown
📝 Walkthrough

Walkthrough

Centralizes platform message-sending safety by moving safeSendMessage, SendMessageContext, and UnknownErrorTracker into packages/workflows/src/executor-shared.ts, updating dag-executor.ts and executor.ts to import and use the shared helper, and adding tests for safeSendMessage.

Changes

Shared message-safety centralization

Layer / File(s) Summary
Data Shape
packages/workflows/src/executor-shared.ts
Adds exported SendMessageContext (workflowId?, nodeName?) and exported UnknownErrorTracker (count: number).
Core Implementation
packages/workflows/src/executor-shared.ts
Adds exported safeSendMessage(platform, conversationId, message, context?, metadata?, unknownErrorTracker?) implementing platform.sendMessage wrapping, error classification (FATAL/TRANSIENT/UNKNOWN), structured logging, unknown-error counting with UNKNOWN_ERROR_THRESHOLD, tracker reset on non-UNKNOWN, rethrow on FATAL, and boolean success/false suppression policy.
Integration / Wiring
packages/workflows/src/dag-executor.ts, packages/workflows/src/executor.ts
Removes local SendMessageContext and local safeSendMessage implementations; import safeSendMessage and type SendMessageContext from executor-shared. executor.ts updates one critical-send failure path to inline getLog().error(...) using event label platform.critical_message_send_failed.
Tests
packages/workflows/src/executor-shared.test.ts
Adds describe('safeSendMessage') suite: verifies successful send resets tracker; TRANSIENT returns false without throwing; FATAL rethrows with prefixed message; UNKNOWN increments tracker and throws when threshold reached; TRANSIENT resets tracker; behavior when tracker omitted (UNKNOWN never throws).
Manifest
package.json
Touched in diff listing (package metadata/test wiring exposure).

Sequence Diagram(s)

sequenceDiagram
  participant Executor
  participant DagExecutor
  participant SafeSend as executor-shared.safeSendMessage
  participant Platform
  participant Logger

  Executor->>SafeSend: safeSendMessage(platform, convId, msg, context, meta, tracker)
  DagExecutor->>SafeSend: safeSendMessage(platform, convId, msg, context, meta, tracker)
  SafeSend->>Platform: platform.sendMessage(convId, msg, meta)
  alt success
    Platform-->>SafeSend: ack
    SafeSend-->>Executor: true
    SafeSend-->>DagExecutor: true
  else failure
    Platform-->>SafeSend: throws error
    SafeSend->>SafeSend: classifyError -> (FATAL / TRANSIENT / UNKNOWN)
    SafeSend->>Logger: log failure diagnostics (platformType, errorType, length, context, stack)
    alt FATAL
      SafeSend-->>Executor: throw Error (rethrow)
      SafeSend-->>DagExecutor: throw Error (rethrow)
    else UNKNOWN below-threshold
      SafeSend-->>Executor: false
      SafeSend-->>DagExecutor: false
    else UNKNOWN hit-threshold
      SafeSend-->>Executor: throw Error
      SafeSend-->>DagExecutor: throw Error
    else TRANSIENT
      SafeSend-->>Executor: false
      SafeSend-->>DagExecutor: false
    end
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related issues

Poem

🐇 I nibbled at the tangled thread,
Brought senders home to one warm bed.
Errors counted, logs in tune,
Messages hum beneath the moon.
Carrot-code, neat and fed.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: consolidating a duplicated safeSendMessage function and SendMessageContext interface from executor.ts and dag-executor.ts into executor-shared.ts.
Description check ✅ Passed The PR description is comprehensive and follows the template structure with all required sections: Summary (problem/why/what changed/scope), UX Journey, Architecture Diagrams, Labels, Change Metadata, Linked Issues, Validation Evidence, Security Impact, Compatibility, Human Verification, Side Effects, Rollback Plan, Risks/Mitigations, and updates from review.
Linked Issues check ✅ Passed The PR consolidates infrastructure (safeSendMessage and UnknownErrorTracker) referenced in #1425 but does not implement the fixes for the two silent-failure bugs. The PR correctly scopes itself to deduplication only, with the author explicitly noting that wiring up the tracker is follow-up work. The PR maintains the status quo: no callsites currently instantiate the tracker.
Out of Scope Changes check ✅ Passed All changes are in-scope refactoring: moving SendMessageContext, UnknownErrorTracker, and safeSendMessage from executor.ts and dag-executor.ts to executor-shared.ts, with corresponding import updates. The PR deliberately excludes activating the UnknownErrorTracker guard in dag-executor and addresses only symbol consolidation and tracker reset-on-non-UNKNOWN bug fix.
Docstring Coverage ✅ Passed Docstring coverage is 80.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
Review rate limit: 7/8 reviews remaining, refill in 7 minutes and 30 seconds.

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
packages/workflows/src/executor.ts (1)

53-65: Use a structured event name here.

'Critical message send failed' breaks the repo’s domain.action_state log naming pattern, which makes this path harder to query alongside the rest of the workflow logs.

As per coding guidelines, "Structured logging with Pino: use event naming format {domain}.{action}_{state}."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/workflows/src/executor.ts` around lines 53 - 65, Replace the
freeform log message string in the getLog().error call inside executor.ts with a
structured event name following the {domain}.{action}_{state} pattern (for
example, "workflow.message_send_failed"), keeping the existing metadata object
(err, conversationId, messageLength, errorType, platform.getPlatformType(),
context, attempt, maxRetries) so the human-readable message is preserved in the
payload if needed; locate the getLog().error invocation in the executor (the
block referencing err, conversationId, message.length, errorType,
platform.getPlatformType(), attempt, maxRetries) and update the second argument
to the structured event name.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@packages/workflows/src/dag-executor.ts`:
- Around line 77-79: The DAG execution path is still not enabling the UNKNOWN
abort guard because dag-executor.ts never creates or threads an
UnknownErrorTracker into its send sites; importing SendMessageContext and
safeSendMessage alone doesn't change behavior. Fix by instantiating an
UnknownErrorTracker (or accept one from the caller) in the relevant DAG flow in
dag-executor.ts and pass it into every safeSendMessage call (and into the
SendMessageContext if your shared API expects it), ensuring the UNKNOWN abort
guard is activated during DAG execution; update function signatures or call
sites (e.g., the functions that call safeSendMessage in this file) to accept and
forward the tracker where needed.

In `@packages/workflows/src/executor-shared.ts`:
- Around line 567-599: The UNKNOWN error counter (unknownErrorTracker.count) is
only reset on success so non-UNKNOWN failures (e.g., TRANSIENT or FATAL) do not
break "consecutive" streaks; update the catch block in the function handling
platform sends (use symbols classifyError, errorType, unknownErrorTracker,
UNKNOWN_ERROR_THRESHOLD) to reset unknownErrorTracker.count = 0 whenever
errorType !== 'UNKNOWN' (ensure this happens before threshold logic and before
throwing for FATAL) so only actual consecutive UNKNOWN errors increment and
trigger the threshold.

---

Nitpick comments:
In `@packages/workflows/src/executor.ts`:
- Around line 53-65: Replace the freeform log message string in the
getLog().error call inside executor.ts with a structured event name following
the {domain}.{action}_{state} pattern (for example,
"workflow.message_send_failed"), keeping the existing metadata object (err,
conversationId, messageLength, errorType, platform.getPlatformType(), context,
attempt, maxRetries) so the human-readable message is preserved in the payload
if needed; locate the getLog().error invocation in the executor (the block
referencing err, conversationId, message.length, errorType,
platform.getPlatformType(), attempt, maxRetries) and update the second argument
to the structured event name.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 426e215a-8059-4a91-98b2-bc947f4ffcaf

📥 Commits

Reviewing files that changed from the base of the PR and between 2945f2e and 7e3bcd6.

📒 Files selected for processing (3)
  • packages/workflows/src/dag-executor.ts
  • packages/workflows/src/executor-shared.ts
  • packages/workflows/src/executor.ts

Comment on lines +77 to 79
safeSendMessage,
type SendMessageContext,
} from './executor-shared';

@coderabbitai coderabbitai Bot Apr 29, 2026

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

The DAG path still never enables the UNKNOWN abort guard.

Importing the shared helper alone does not change behavior here: this file never creates or passes an UnknownErrorTracker, so every safeSendMessage(...) call still suppresses UNKNOWN failures indefinitely. If the goal of this refactor is to stop silent looping in DAG execution, the tracker needs to be threaded through these send sites.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/workflows/src/dag-executor.ts` around lines 77 - 79, The DAG
execution path is still not enabling the UNKNOWN abort guard because
dag-executor.ts never creates or threads an UnknownErrorTracker into its send
sites; importing SendMessageContext and safeSendMessage alone doesn't change
behavior. Fix by instantiating an UnknownErrorTracker (or accept one from the
caller) in the relevant DAG flow in dag-executor.ts and pass it into every
safeSendMessage call (and into the SendMessageContext if your shared API expects
it), ensuring the UNKNOWN abort guard is activated during DAG execution; update
function signatures or call sites (e.g., the functions that call safeSendMessage
in this file) to accept and forward the tracker where needed.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — you're right that importing the shared helper alone doesn't activate the guard, and my original commit message overclaimed. The PR description and commit message have been updated to honestly describe this as deduplication-only; the UnknownErrorTracker remains dormant infrastructure on both sides (same as upstream/dev today, where executor.ts declares the interface but never instantiates one either).

Note for maintainer (cc'd in the PR description): I deliberately scoped this PR to extraction so the refactor is review-friendly and easy to revert. If you'd prefer this PR also wire up the tracker in dag-executor.ts (and possibly executor.ts's callsites) to actually deliver the abort behavior in production, happy to add that on top — just let me know which scope you'd prefer. Otherwise I'd suggest activation as a follow-up PR so the design choices (per-conversation tracker? per-DAG-run? where does it live?) get their own discussion.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ericsoriano, thanks for the clear explanation and for correcting the PR description — that kind of transparency makes review much easier.

The deduplication-only scope is a perfectly reasonable call. The design questions you raised (tracker lifetime: per-conversation vs. per-DAG-run, ownership, where it lives) are genuinely non-trivial, and bundling those decisions into the same PR as the extraction would make both harder to review or revert independently.

The remaining observation — that executor.ts also never actually instantiates an UnknownErrorTracker despite defining the interface — reinforces that activation is a distinct piece of work that warrants its own PR and discussion.

Would you like me to open a follow-up GitHub issue to track the activation work (wiring UnknownErrorTracker into both executor.ts and dag-executor.ts call sites, with a design note on tracker lifetime/ownership)? That way the decision doesn't get lost after this PR merges.

(^_^)


✏️ Learnings added
Learnt from: ericsoriano
Repo: coleam00/Archon PR: 1496
File: packages/workflows/src/dag-executor.ts:77-79
Timestamp: 2026-04-29T18:07:33.180Z
Learning: In `coleam00/Archon` `packages/workflows/src/dag-executor.ts` and `packages/workflows/src/executor.ts`: `UnknownErrorTracker` is intentionally dormant/never instantiated on both sides as of PR `#1496`. PR `#1496` was scoped to extraction/deduplication only (moving `safeSendMessage`, `SendMessageContext`, `UnknownErrorTracker` into `executor-shared.ts`). Activating the abort guard (actually wiring `UnknownErrorTracker` into call sites) is deliberately deferred to a follow-up PR; open design questions include tracker lifetime (per-conversation vs. per-DAG-run) and ownership location.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.


🧠 Learnings used
Learnt from: atlas-architect
Repo: coleam00/Archon PR: 0
File: :0-0
Timestamp: 2026-04-28T21:00:22.937Z
Learning: In `packages/workflows/src/dag-executor.test.ts` (Bun test suite): always wrap `expect(...)` assertions and `spy.mockRestore()` in a `try/finally` block so the spy is restored even when the assertion throws. A leaked `execFileAsync` spy bleeds into downstream unrelated tests and produces misleading outputs (e.g. `'Use: ok'`). Pattern: `try { expect(spy).toHaveBeenCalledWith(...) } finally { spy.mockRestore() }`.

Learnt from: CR
Repo: coleam00/Archon PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-28T11:00:25.790Z
Learning: Applies to **/*.{ts,tsx} : Use `execFileAsync` (not `exec`) when calling git or external commands directly from TypeScript; handle expected failures gracefully (missing directories during cleanup); surface git errors to users for actionable issues (conflicts, uncommitted changes)

Learnt from: CR
Repo: coleam00/Archon PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-28T11:00:25.790Z
Learning: Applies to packages/workflows/src/**/*.ts : archon/workflows: workflow engine (loader, router, executor, DAG, logger, bundled defaults); depends only on archon/git + archon/paths + archon/providers/types + hono/zod-openapi + zod; DB/AI/config injected via `WorkflowDeps`

Learnt from: Wirasm
Repo: coleam00/Archon PR: 0
File: :0-0
Timestamp: 2026-04-28T10:42:58.632Z
Learning: In the `coleam00/Archon` repo, the `archon/providers` import boundary documented in `CLAUDE.md` is aspirational, not enforced. As of PR `#1463`, `archon/workflows` already directly imports registry functions (`getProviderCapabilities`, `getRegisteredProviders`, `isRegisteredProvider`) from `archon/providers` (not `/types`) in `packages/workflows/src/dag-executor.ts`, `packages/workflows/src/executor.ts`, and `packages/workflows/src/validator.ts`. Registry queries from `archon/providers` in workflow-package files are acceptable consistent with this established precedent.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please open a follow-up GitHub issue to track the activation work (wiring UnknownErrorTracker into both executor.ts and dag-executor.ts

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

🏁 Script executed:

gh issue create \
  --repo coleam00/Archon \
  --assignee ericsoriano \
  --title "feat(workflows): activate UnknownErrorTracker abort guard in executor.ts and dag-executor.ts" \
  --body "## Summary

This issue tracks the follow-up activation work deferred from `#1496`.

PR `#1496` extracted \`safeSendMessage\`, \`SendMessageContext\`, and \`UnknownErrorTracker\` into \`executor-shared.ts\` (deduplication only). As of that PR, \`UnknownErrorTracker\` remains dormant/never instantiated on **both** sides — \`executor.ts\` and \`dag-executor.ts\` — meaning the consecutive-UNKNOWN abort guard is available as infrastructure but not active in production.

## Work required

- Wire \`UnknownErrorTracker\` into the \`safeSendMessage\` call sites in \`packages/workflows/src/executor.ts\`
- Wire \`UnknownErrorTracker\` into the \`safeSendMessage\` call sites in \`packages/workflows/src/dag-executor.ts\`

## Open design questions to resolve before implementing

1. **Tracker lifetime**: should the tracker be scoped per-conversation or per-DAG-run?
2. **Ownership location**: where should the tracker be instantiated and held — at the top-level executor, per-node, or passed in via \`WorkflowDeps\`?
3. **Behavior on abort**: what should happen when the threshold is hit in DAG execution — fail the current node, fail the entire DAG run, or surface to the user for a retry decision?

## References

- PR: https://github.com/coleam00/Archon/pull/1496
- Comment thread: https://github.com/coleam00/Archon/pull/1496#discussion_r3163035187
- Requested by: `@ericsoriano`
"

Length of output: 77


🏁 Script executed:

gh issue create \
  --repo coleam00/Archon \
  --title "feat(workflows): activate UnknownErrorTracker abort guard in executor.ts and dag-executor.ts" \
  --body "## Summary

This issue tracks the follow-up activation work deferred from `#1496`.

PR `#1496` extracted \`safeSendMessage\`, \`SendMessageContext\`, and \`UnknownErrorTracker\` into \`executor-shared.ts\` (deduplication only). As of that PR, \`UnknownErrorTracker\` remains dormant/never instantiated on **both** sides — \`executor.ts\` and \`dag-executor.ts\` — meaning the consecutive-UNKNOWN abort guard is available as infrastructure but not active in production.

## Work required

- Wire \`UnknownErrorTracker\` into the \`safeSendMessage\` call sites in \`packages/workflows/src/executor.ts\`
- Wire \`UnknownErrorTracker\` into the \`safeSendMessage\` call sites in \`packages/workflows/src/dag-executor.ts\`

## Open design questions to resolve before implementing

1. **Tracker lifetime**: should the tracker be scoped per-conversation or per-DAG-run?
2. **Ownership location**: where should the tracker be instantiated and held — at the top-level executor, per-node, or passed in via \`WorkflowDeps\`?
3. **Behavior on abort**: what should happen when the threshold is hit in DAG execution — fail the current node, fail the entire DAG run, or surface to the user for a retry decision?

## References

- PR: https://github.com/coleam00/Archon/pull/1496
- Comment thread: https://github.com/coleam00/Archon/pull/1496#discussion_r3163035187
- Requested by: `@ericsoriano`
"

Length of output: 47


@ericsoriano, done! The follow-up issue has been opened here: #1512

It captures:

  • The two files that need wiring (executor.ts and dag-executor.ts)
  • The three open design questions (tracker lifetime, ownership location, abort behavior)
  • Backlinks to PR #1496 and this comment thread

Note: I wasn't able to auto-assign it due to integration permissions — you may want to assign it to yourself directly on GitHub.


🧠 Learnings used
Learnt from: ericsoriano
Repo: coleam00/Archon PR: 1496
File: packages/workflows/src/dag-executor.ts:77-79
Timestamp: 2026-04-29T18:07:33.180Z
Learning: In `coleam00/Archon` `packages/workflows/src/dag-executor.ts` and `packages/workflows/src/executor.ts`: `UnknownErrorTracker` is intentionally dormant/never instantiated on both sides as of PR `#1496`. PR `#1496` was scoped to extraction/deduplication only (moving `safeSendMessage`, `SendMessageContext`, `UnknownErrorTracker` into `executor-shared.ts`). Activating the abort guard (actually wiring `UnknownErrorTracker` into call sites) is deliberately deferred to a follow-up PR; open design questions include tracker lifetime (per-conversation vs. per-DAG-run) and ownership location.

Learnt from: atlas-architect
Repo: coleam00/Archon PR: 0
File: :0-0
Timestamp: 2026-04-28T21:00:22.937Z
Learning: In `packages/workflows/src/dag-executor.test.ts` (Bun test suite): always wrap `expect(...)` assertions and `spy.mockRestore()` in a `try/finally` block so the spy is restored even when the assertion throws. A leaked `execFileAsync` spy bleeds into downstream unrelated tests and produces misleading outputs (e.g. `'Use: ok'`). Pattern: `try { expect(spy).toHaveBeenCalledWith(...) } finally { spy.mockRestore() }`.

Learnt from: CR
Repo: coleam00/Archon PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-28T11:00:25.790Z
Learning: Applies to **/*.{ts,tsx} : Use `execFileAsync` (not `exec`) when calling git or external commands directly from TypeScript; handle expected failures gracefully (missing directories during cleanup); surface git errors to users for actionable issues (conflicts, uncommitted changes)

Learnt from: CR
Repo: coleam00/Archon PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-28T11:00:25.790Z
Learning: Applies to packages/workflows/src/**/*.ts : archon/workflows: workflow engine (loader, router, executor, DAG, logger, bundled defaults); depends only on archon/git + archon/paths + archon/providers/types + hono/zod-openapi + zod; DB/AI/config injected via `WorkflowDeps`

Learnt from: Wirasm
Repo: coleam00/Archon PR: 0
File: :0-0
Timestamp: 2026-04-28T10:42:58.632Z
Learning: In the `coleam00/Archon` repo, the `archon/providers` import boundary documented in `CLAUDE.md` is aspirational, not enforced. As of PR `#1463`, `archon/workflows` already directly imports registry functions (`getProviderCapabilities`, `getRegisteredProviders`, `isRegisteredProvider`) from `archon/providers` (not `/types`) in `packages/workflows/src/dag-executor.ts`, `packages/workflows/src/executor.ts`, and `packages/workflows/src/validator.ts`. Registry queries from `archon/providers` in workflow-package files are acceptable consistent with this established precedent.

Learnt from: CR
Repo: coleam00/Archon PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-28T11:00:25.790Z
Learning: When a process cannot reliably distinguish 'actively running elsewhere' from 'orphaned by a crash' (started by a different process or input source), it must not autonomously mark work as failed/cancelled/abandoned based on a timer or staleness guess; surface the ambiguous state to the user and provide a one-click action (No Autonomous Lifecycle Mutation Across Process Boundaries)

Learnt from: CR
Repo: coleam00/Archon PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-28T11:00:25.790Z
Learning: Applies to packages/providers/src/**/*.ts : archon/providers: AI agent providers (Claude, Codex, Pi community) — owns SDK deps; `IAgentProvider` interface and `sendQuery()` contract; `archon/providers/types` is the contract subpath (zero SDK deps) that archon/workflows imports; providers receive raw `nodeConfig` + `assistantConfig` and translate to SDK-specific options internally

Comment thread packages/workflows/src/executor-shared.ts
ericsoriano added a commit to ericsoriano/Archon that referenced this pull request Apr 29, 2026
executor.ts and dag-executor.ts each defined a near-identical copy of
safeSendMessage and SendMessageContext. Move both into executor-shared.ts
as exports and import from there. The UnknownErrorTracker guard is also
relocated and exported, but remains dormant infrastructure: no callsite
in either file instantiates a tracker today, so this PR delivers pure
deduplication with no behavioral change. Activating the guard in DAG
execution is appropriate follow-up work.

Also addresses two issues surfaced during code review (coleam00#1496):
- Reset UnknownErrorTracker.count on any non-UNKNOWN error (was only
  resetting on success), so a sequence like UNKNOWN -> TRANSIENT ->
  UNKNOWN -> UNKNOWN no longer trips the threshold incorrectly.
- Use the structured 'platform.critical_message_send_failed' event name
  in the inlined log call instead of the prose 'Critical message send
  failed' string.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ericsoriano ericsoriano force-pushed the refactor/consolidate-safe-send-message branch from 7e3bcd6 to fec4987 Compare April 29, 2026 18:05

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
packages/workflows/src/executor-shared.ts (1)

573-584: Align shared log event name with project logging convention.

Use {domain}.{action}_{state} format for this event key to stay consistent with logging standards.

♻️ Suggested diff
-      'platform_message_send_failed'
+      'platform.message_send_failed'

As per coding guidelines, structured logging with Pino must use event naming format {domain}.{action}_{state}.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/workflows/src/executor-shared.ts` around lines 573 - 584, The log
event key 'platform_message_send_failed' does not follow the
{domain}.{action}_{state} convention; update the getLog().error call (the call
site using getLog(), platform.getPlatformType(), err, conversationId,
message.length, etc.) to use a dotted domain format such as
'platform.message_send_failed' (or the correct domain token for your project)
and update any other matching occurrences to the same convention; run tests/lint
to ensure consistency.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@packages/workflows/src/executor-shared.ts`:
- Around line 573-584: The log event key 'platform_message_send_failed' does not
follow the {domain}.{action}_{state} convention; update the getLog().error call
(the call site using getLog(), platform.getPlatformType(), err, conversationId,
message.length, etc.) to use a dotted domain format such as
'platform.message_send_failed' (or the correct domain token for your project)
and update any other matching occurrences to the same convention; run tests/lint
to ensure consistency.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: fd0e40cb-c111-4a4b-b306-bcd5d971863e

📥 Commits

Reviewing files that changed from the base of the PR and between 7e3bcd6 and fec4987.

📒 Files selected for processing (3)
  • packages/workflows/src/dag-executor.ts
  • packages/workflows/src/executor-shared.ts
  • packages/workflows/src/executor.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • packages/workflows/src/dag-executor.ts

@Wirasm

Wirasm commented May 1, 2026

Copy link
Copy Markdown
Collaborator

Review Summary

Verdict: minor-fixes-needed

This PR deduplicates safeSendMessage, SendMessageContext, and UnknownErrorTracker into executor-shared.ts. The refactor is clean and type-safe — no bugs, no breaking changes, and error handling is solid throughout. One new behavioral feature (consecutive-UNKNOWN tracking) needs a dedicated unit test. A few LOW polish items are also noted.

Blocking issues

  • (none — no CRITICAL findings)

Suggested fixes

  • executor-shared.test.ts: safeSendMessage lacks direct unit tests. The consecutive-UNKNOWN tracking logic and parameter reorder (metadata before unknownErrorTracker) aren't exercised by the existing DAG tests (which mock platform.sendMessage to always succeed). Add a unit test that directly imports and calls safeSendMessage, mocking platform.sendMessage to throw classified errors. Cover:
    • UNKNOWN error increments tracker, returns false; third consecutive UNKNOWN throws
    • TRANSIENT returns false without throwing
    • FATAL rethrows
    • Success resets tracker to 0
    • Call sequence [UNKNOWN, TRANSIENT, UNKNOWN] — tracker resets on TRANSIENT, final UNKNOWN counts as "1" (no throw)
    • Calls without unknownErrorTracker (the DAG executor path) still work correctly

Minor / nice-to-have

  • executor-shared.ts:563 — Trim the safeSendMessage docstring. The multi-paragraph comment is mostly redundant with the type signature and function name. Keep the "fatal errors rethrown" invariant (it's the non-obvious part), drop the rest:

    /** Sends a message; returns false on transient/unknown failures. FATAL errors are always rethrown. */
  • executor-shared.ts: Remove the decorative section header // ─── Platform Message Sending ──── and the redundant /** Context for platform message sending */ comment on SendMessageContext — the names already convey this.

  • executor-shared.ts:539safeSendMessage's unknownErrorTracker param is never passed by dag-executor.ts (dead code in the dag path). Add a comment on the signature: /** unknownErrorTracker: used only by executor.ts — dag-executor.ts does not use consecutive-UNKNOWN tracking */, or remove it from the dag-side signature if you'd prefer the divergence explicit.

  • executor.ts:32–33 vs executor-shared.ts:539SendMessageContext field naming: executor.ts uses stepName, the shared definition uses nodeName. Optional: rename to nodeName in executor.ts for consistency, or add a comment noting per-caller naming intent.

  • Monitoring: The log event name changed from dag_node_message_send_failed (dag-executor) to platform_message_send_failed (shared). If you have log queries or dashboards filtering on the old name, update them accordingly.

Compliments

  • The reset-tracker comment in executor-shared.ts (// Reset tracker on any non-UNKNOWN outcome — only *consecutive* UNKNOWN errors should trip the threshold) captures a non-obvious semantic rule clearly.
  • Error handling throughout is correct — FATAL errors rethrow, TRANSIENT/UNKNOWN return false, and every failure path logs with full context before returning. No silent swallows.
  • The PR scope is tight: 3 files, one subsystem, explicitly bounded. The template documentation with Before/After diagrams and architecture table is thorough.

Reviewed via maintainer-review-pr workflow (Pi/Minimax). Aspects run: code-review, error-handling, test-coverage, comment-quality.

ericsoriano and others added 2 commits May 4, 2026 10:09
executor.ts and dag-executor.ts each defined a near-identical copy of
safeSendMessage and SendMessageContext. Move both into executor-shared.ts
as exports and import from there. The UnknownErrorTracker guard is also
relocated and exported, but remains dormant infrastructure: no callsite
in either file instantiates a tracker today, so this PR delivers pure
deduplication with no behavioral change. Activating the guard in DAG
execution is appropriate follow-up work.

Also addresses two issues surfaced during code review (coleam00#1496):
- Reset UnknownErrorTracker.count on any non-UNKNOWN error (was only
  resetting on success), so a sequence like UNKNOWN -> TRANSIENT ->
  UNKNOWN -> UNKNOWN no longer trips the threshold incorrectly.
- Use the structured 'platform.critical_message_send_failed' event name
  in the inlined log call instead of the prose 'Critical message send
  failed' string.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Covers the six cases requested in the PR review:
- Success resets UnknownErrorTracker to 0 and returns true
- TRANSIENT error returns false without throwing
- FATAL error rethrows with 'Platform authentication/permission error'
- UNKNOWN error increments tracker and returns false below threshold
- Three consecutive UNKNOWN errors trip the threshold and throw
- TRANSIENT resets tracker so a subsequent UNKNOWN does not trip threshold
- Calls without unknownErrorTracker (DAG executor path) never throw

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@ericsoriano ericsoriano force-pushed the refactor/consolidate-safe-send-message branch from e549687 to f1873a1 Compare May 4, 2026 17:11

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
packages/workflows/src/executor.ts (1)

45-69: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Keep critical sends on the shared escalation policy.

This retry path now bypasses safeSendMessage, so it suppresses cases the shared helper escalates: FATAL only breaks the loop here, and consecutive UNKNOWN failures never trip the shared abort threshold. Because several callers continue after a false return, platform auth/config issues can degrade into silent notification loss instead of surfacing immediately. Please route this through safeSendMessage, or explicitly document why downgrading those failures is safe for critical notifications.

As per coding guidelines, "Document fallback behavior with a comment when a fallback is intentional and safe; otherwise throw".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/workflows/src/executor.ts` around lines 45 - 69, The retry loop
currently calls platform.sendMessage directly and bypasses the shared escalation
logic in safeSendMessage, which suppresses escalation for FATAL/UNKNOWN errors;
update the loop to call the shared helper (safeSendMessage) instead of
platform.sendMessage so the common abort/escalation behavior is preserved (or if
you intentionally want different behavior, add a clear comment documenting why
downgrading is safe). Concretely, replace the
platform.sendMessage(conversationId, message, metadata) call inside the for-loop
with safeSendMessage(...) (passing the same conversationId, message, metadata
and context), remove or adapt local classifyError-based suppression, and ensure
FATAL errors are propagated or handled by the shared helper rather than silently
breaking the loop.
🧹 Nitpick comments (1)
packages/workflows/src/executor.ts (1)

64-64: Update log consumers for the renamed failure event.

This label change will break any dashboards or alerts still querying the previous send-failure event name. Make sure monitoring includes platform.critical_message_send_failed before this lands.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/workflows/src/executor.ts` at line 64, The renamed failure event
label has changed to 'platform.critical_message_send_failed'; update any
monitoring, dashboards, alerts and code that query or emit the old event name to
use 'platform.critical_message_send_failed' instead. Search for occurrences of
the old label in logging/telemetry consumers and in your emitter/handler
functions (the code paths that call the log/emit functions in executor.ts) and
replace them with 'platform.critical_message_send_failed', then run smoke tests
on dashboards/alerts to confirm they pick up events under the new name.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@packages/workflows/src/executor.ts`:
- Around line 45-69: The retry loop currently calls platform.sendMessage
directly and bypasses the shared escalation logic in safeSendMessage, which
suppresses escalation for FATAL/UNKNOWN errors; update the loop to call the
shared helper (safeSendMessage) instead of platform.sendMessage so the common
abort/escalation behavior is preserved (or if you intentionally want different
behavior, add a clear comment documenting why downgrading is safe). Concretely,
replace the platform.sendMessage(conversationId, message, metadata) call inside
the for-loop with safeSendMessage(...) (passing the same conversationId,
message, metadata and context), remove or adapt local classifyError-based
suppression, and ensure FATAL errors are propagated or handled by the shared
helper rather than silently breaking the loop.

---

Nitpick comments:
In `@packages/workflows/src/executor.ts`:
- Line 64: The renamed failure event label has changed to
'platform.critical_message_send_failed'; update any monitoring, dashboards,
alerts and code that query or emit the old event name to use
'platform.critical_message_send_failed' instead. Search for occurrences of the
old label in logging/telemetry consumers and in your emitter/handler functions
(the code paths that call the log/emit functions in executor.ts) and replace
them with 'platform.critical_message_send_failed', then run smoke tests on
dashboards/alerts to confirm they pick up events under the new name.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d0395a0e-3689-4bbd-af51-7b9caf6cb38f

📥 Commits

Reviewing files that changed from the base of the PR and between e549687 and f1873a1.

📒 Files selected for processing (4)
  • packages/workflows/src/dag-executor.ts
  • packages/workflows/src/executor-shared.test.ts
  • packages/workflows/src/executor-shared.ts
  • packages/workflows/src/executor.ts
🚧 Files skipped from review as they are similar to previous changes (2)
  • packages/workflows/src/executor-shared.test.ts
  • packages/workflows/src/executor-shared.ts

@ericsoriano

Copy link
Copy Markdown
Contributor Author

Tests added for safeSendMessage

Addresses the test gap flagged in the review. Added a describe('safeSendMessage') block in executor-shared.test.ts covering the 6 cases requested:

Test case What it verifies
Success path Returns true, resets UnknownErrorTracker.count to 0
TRANSIENT error Returns false without throwing
FATAL error Rethrows with 'Platform authentication/permission error'
UNKNOWN below threshold Increments tracker, returns false
3 consecutive UNKNOWNs Throws '3 consecutive unrecognized errors'
[UNKNOWN→TRANSIENT→UNKNOWN] sequence Count resets on TRANSIENT — no throw on final UNKNOWN
No tracker (DAG executor path) UNKNOWN errors never throw regardless of call count

Validation

```
bun run validate → exit 0
```

Check Result
`type-check` (all 10 packages)
`lint` (0 warnings)
`format:check`
`test` (all packages)

@Wirasm Wirasm merged commit 45bc5e5 into coleam00:dev May 18, 2026
4 checks passed
cropse pushed a commit to cropse/Archon that referenced this pull request May 19, 2026
…utor-shared (coleam00#1496)

* refactor(workflows): consolidate safeSendMessage into executor-shared

executor.ts and dag-executor.ts each defined a near-identical copy of
safeSendMessage and SendMessageContext. Move both into executor-shared.ts
as exports and import from there. The UnknownErrorTracker guard is also
relocated and exported, but remains dormant infrastructure: no callsite
in either file instantiates a tracker today, so this PR delivers pure
deduplication with no behavioral change. Activating the guard in DAG
execution is appropriate follow-up work.

Also addresses two issues surfaced during code review (coleam00#1496):
- Reset UnknownErrorTracker.count on any non-UNKNOWN error (was only
  resetting on success), so a sequence like UNKNOWN -> TRANSIENT ->
  UNKNOWN -> UNKNOWN no longer trips the threshold incorrectly.
- Use the structured 'platform.critical_message_send_failed' event name
  in the inlined log call instead of the prose 'Critical message send
  failed' string.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(workflows): add unit tests for safeSendMessage in executor-shared

Covers the six cases requested in the PR review:
- Success resets UnknownErrorTracker to 0 and returns true
- TRANSIENT error returns false without throwing
- FATAL error rethrows with 'Platform authentication/permission error'
- UNKNOWN error increments tracker and returns false below threshold
- Three consecutive UNKNOWN errors trip the threshold and throw
- TRANSIENT resets tracker so a subsequent UNKNOWN does not trip threshold
- Calls without unknownErrorTracker (DAG executor path) never throw

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Wirasm Wirasm mentioned this pull request May 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug(workflow): two silent-failure paths in DAG executor — Codex turn.failed exits stream, Claude success-with-stop-sequence thrown as failure

2 participants