Skip to content

research: AgentEngine → TaskEngine real-time status sync strategy #323

@Aureliolo

Description

@Aureliolo

⚠️ NOT DECIDED — Requires Deep Research

This issue tracks a critical architectural decision that has not been made yet. It must be fully researched and reviewed before any implementation begins. The choice here significantly affects observability, correctness, and long-term system reliability.


Problem

AgentEngine._report_to_task_engine currently calls transition_task(COMPLETED | FAILED | CANCELLED) directly on the TaskEngine. This works for the final status, but the task's TaskEngine state is frozen at ASSIGNED or IN_PROGRESS for the entire duration of the agent's execution — meaning:

  • Other agents querying list_tasks() see stale state (no in-flight visibility)
  • The MessageBus receives no intermediate events during execution (no IN_REVIEW, no AWAITING_APPROVAL)
  • If a crash happens mid-execution, the task stays in IN_PROGRESS indefinitely with no recovery hint
  • There is no version tracking thread across the lifecycle, so a later catch-up write has no safe anchor point

The deeper issue: Task itself has no version field — version lives only in TaskEngine._versions (an in-memory dict). So any catch-up loop must use TaskMutationResult.version returned by each submit() call.


Current Behavior

[AgentEngine]                     [TaskEngine]          [MessageBus]
start execution          ────►   IN_PROGRESS            event: IN_PROGRESS
...running for 5 minutes...      IN_PROGRESS            (nothing)
...review gate...                 IN_PROGRESS            (nothing)
finish                   ────►   COMPLETED              event: COMPLETED

Desired Behavior (example — not decided)

[AgentEngine]                     [TaskEngine]          [MessageBus]
start execution          ────►   IN_PROGRESS            event: IN_PROGRESS
...working...                     IN_PROGRESS            (internal only)
enter review gate        ────►   IN_REVIEW              event: IN_REVIEW
review complete          ────►   IN_PROGRESS            event: IN_PROGRESS
finish                   ────►   COMPLETED              event: COMPLETED

Options Under Consideration

Option A — Incremental sync (real-time transitions)

AgentEngine calls transition_task() at each meaningful lifecycle stage:

  1. IN_PROGRESS at execution start
  2. IN_REVIEW when entering code review gate
  3. AWAITING_APPROVAL when blocked on human approval
  4. Final status (COMPLETED / FAILED / CANCELLED) at end

Pros:

  • Real-time visibility into every stage
  • MessageBus events fire at each transition — subscribers react immediately
  • Crash recovery: last known state is the actual last stage reached
  • Aligns with event-sourcing / audit trail patterns

Cons:

  • AgentEngine must know the full task state machine (tight coupling to TaskStatus FSM)
  • Every lifecycle point needs an explicit transition_task() call — easy to forget new states
  • If a transition fails (invalid state machine path), execution may be partially logged

Option B — Staged catch-up with optimistic locking

AgentEngine tracks transitions locally, then replays them through TaskEngine after the fact using the version number from TaskMutationResult:

local_journal = [IN_PROGRESS, IN_REVIEW, COMPLETED]
for step in local_journal:
    result = await task_engine.submit(TransitionTaskMutation(..., expected_version=last_version))
    last_version = result.version

Pros:

  • AgentEngine stays decoupled from TaskEngine during hot execution path
  • Version locking prevents replay conflicts if another writer touched the task

Cons:

  • No real-time visibility — state is only correct after catch-up replay
  • If catch-up replay fails mid-way, TaskEngine ends up in a partial state
  • Task has no version field — must track version from TaskMutationResult manually
  • Adds significant complexity (journal + replay machinery) with no real-time benefit

Option C — Hybrid: incremental sync + correctness safety net

Combine A and B:

  • Use incremental sync (Option A) for real-time transitions
  • On final completion, use version-anchored verification to confirm TaskEngine state matches expectation
  • If verification fails (e.g., stale version, missed intermediate transition), emit a compensating mutation

Pros:

  • Real-time observability AND correctness guarantee
  • Crash recovery works at the last confirmed transition
  • Best of both worlds

Cons:

  • More implementation surface area
  • Need to design the compensation logic carefully to avoid infinite correction loops

Option D — Heartbeat / polling model

AgentEngine publishes progress to a lightweight side-channel (e.g., an in-memory progress dict or MessageBus topic), while TaskEngine state is only updated at start and end.

Pros:

  • Minimal coupling between execution and task state machine
  • Fast — no queue round-trips during execution

Cons:

  • TaskEngine state is still stale (no intermediate transitions)
  • Progress data is ephemeral — no persistent audit trail
  • Observers must subscribe to a different channel than task state changes

Option E — State machine relaxation

Relax the Task.with_transition() FSM to allow ASSIGNED → COMPLETED or IN_PROGRESS → COMPLETED in one hop.

Pros:

  • Simplest implementation — existing _report_to_task_engine just works
  • No additional transition_task() calls needed in AgentEngine

Cons:

  • Loses all real-time observability (same as current)
  • Audit trail only shows start/end, not what happened in between
  • Does not satisfy the "smart connected system" requirement

Open Research Questions

  1. What intermediate states does AgentEngine actually traverse? Enumerate all lifecycle points where a transition_task() call would be semantically meaningful. Does every agent execution follow the same state path, or does it vary by execution loop (ReactLoop, PlanExecuteLoop, etc.)?

  2. Version tracking design: Task has no version field. Should version be added to Task for transparency, or should it remain internal to TaskEngine._versions? If internal, how does AgentEngine track the last confirmed version across multiple transitions?

  3. Failure semantics: If an intermediate transition_task() call fails (e.g., FSM rejects the transition), should the agent abort, retry, or continue and log a warning? What is the correct recovery path?

  4. Concurrency: If multiple sub-agents are writing back to the same task concurrently (parallel execution), how does optimistic locking interact with incremental sync? Can the version sequence become non-linear?

  5. FSM completeness: Does the current Task.with_transition() state machine support all needed intermediate transitions (IN_PROGRESS → IN_REVIEW → IN_PROGRESS → COMPLETED)? Are any new edges needed?

  6. Subscriber impact: What downstream subscribers (dashboard, other agents, HR module) currently consume TaskStateChanged events? What events do they need to react correctly to partial execution stages?

  7. Performance: How many transition_task() calls does a typical agent execution generate under Option A? Is there a risk of queue saturation for long-running agents?


Relevant Files

  • src/ai_company/engine/agent_engine.py_report_to_task_engine(), execution lifecycle
  • src/ai_company/engine/task_engine.pytransition_task(), _versions, mutation loop
  • src/ai_company/engine/task_engine_models.pyTransitionTaskMutation, TaskMutationResult.version
  • src/ai_company/core/task.pyTask.with_transition(), FSM edges
  • src/ai_company/core/enums.pyTaskStatus enum values
  • src/ai_company/engine/react_loop.py — execution loop lifecycle
  • src/ai_company/engine/plan_execute_loop.py — plan-and-execute lifecycle

Decision Criteria

The chosen approach must satisfy:

  • Real-time visibility into in-flight task state (other agents can observe current stage)
  • MessageBus events at each meaningful lifecycle transition
  • Crash-safe: last known state is recoverable, not forever IN_PROGRESS
  • Correctness: no version conflicts under concurrent sub-agent execution
  • Auditability: full transition history reconstructible from events
  • Simplicity: minimal coupling between AgentEngine and TaskEngine internals

Metadata

Metadata

Assignees

No one assigned

    Labels

    prio:highImportant, should be prioritizedscope:large3+ days of workspec:task-workflowDESIGN_SPEC Section 6 - Task & Workflow Enginetype:researchEvaluate options, make tech decisions

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions