-
Notifications
You must be signed in to change notification settings - Fork 0
research: AgentEngine → TaskEngine real-time status sync strategy #323
Description
⚠️ NOT DECIDED — Requires Deep Research
This issue tracks a critical architectural decision that has not been made yet. It must be fully researched and reviewed before any implementation begins. The choice here significantly affects observability, correctness, and long-term system reliability.
Problem
AgentEngine._report_to_task_engine currently calls transition_task(COMPLETED | FAILED | CANCELLED) directly on the TaskEngine. This works for the final status, but the task's TaskEngine state is frozen at ASSIGNED or IN_PROGRESS for the entire duration of the agent's execution — meaning:
- Other agents querying
list_tasks()see stale state (no in-flight visibility) - The MessageBus receives no intermediate events during execution (no
IN_REVIEW, noAWAITING_APPROVAL) - If a crash happens mid-execution, the task stays in
IN_PROGRESSindefinitely with no recovery hint - There is no version tracking thread across the lifecycle, so a later catch-up write has no safe anchor point
The deeper issue: Task itself has no version field — version lives only in TaskEngine._versions (an in-memory dict). So any catch-up loop must use TaskMutationResult.version returned by each submit() call.
Current Behavior
[AgentEngine] [TaskEngine] [MessageBus]
start execution ────► IN_PROGRESS event: IN_PROGRESS
...running for 5 minutes... IN_PROGRESS (nothing)
...review gate... IN_PROGRESS (nothing)
finish ────► COMPLETED event: COMPLETED
Desired Behavior (example — not decided)
[AgentEngine] [TaskEngine] [MessageBus]
start execution ────► IN_PROGRESS event: IN_PROGRESS
...working... IN_PROGRESS (internal only)
enter review gate ────► IN_REVIEW event: IN_REVIEW
review complete ────► IN_PROGRESS event: IN_PROGRESS
finish ────► COMPLETED event: COMPLETED
Options Under Consideration
Option A — Incremental sync (real-time transitions)
AgentEngine calls transition_task() at each meaningful lifecycle stage:
IN_PROGRESSat execution startIN_REVIEWwhen entering code review gateAWAITING_APPROVALwhen blocked on human approval- Final status (
COMPLETED/FAILED/CANCELLED) at end
Pros:
- Real-time visibility into every stage
- MessageBus events fire at each transition — subscribers react immediately
- Crash recovery: last known state is the actual last stage reached
- Aligns with event-sourcing / audit trail patterns
Cons:
AgentEnginemust know the full task state machine (tight coupling toTaskStatusFSM)- Every lifecycle point needs an explicit
transition_task()call — easy to forget new states - If a transition fails (invalid state machine path), execution may be partially logged
Option B — Staged catch-up with optimistic locking
AgentEngine tracks transitions locally, then replays them through TaskEngine after the fact using the version number from TaskMutationResult:
local_journal = [IN_PROGRESS, IN_REVIEW, COMPLETED]
for step in local_journal:
result = await task_engine.submit(TransitionTaskMutation(..., expected_version=last_version))
last_version = result.versionPros:
AgentEnginestays decoupled fromTaskEngineduring hot execution path- Version locking prevents replay conflicts if another writer touched the task
Cons:
- No real-time visibility — state is only correct after catch-up replay
- If catch-up replay fails mid-way, TaskEngine ends up in a partial state
Taskhas noversionfield — must track version fromTaskMutationResultmanually- Adds significant complexity (journal + replay machinery) with no real-time benefit
Option C — Hybrid: incremental sync + correctness safety net
Combine A and B:
- Use incremental sync (Option A) for real-time transitions
- On final completion, use version-anchored verification to confirm TaskEngine state matches expectation
- If verification fails (e.g., stale version, missed intermediate transition), emit a compensating mutation
Pros:
- Real-time observability AND correctness guarantee
- Crash recovery works at the last confirmed transition
- Best of both worlds
Cons:
- More implementation surface area
- Need to design the compensation logic carefully to avoid infinite correction loops
Option D — Heartbeat / polling model
AgentEngine publishes progress to a lightweight side-channel (e.g., an in-memory progress dict or MessageBus topic), while TaskEngine state is only updated at start and end.
Pros:
- Minimal coupling between execution and task state machine
- Fast — no queue round-trips during execution
Cons:
- TaskEngine state is still stale (no intermediate transitions)
- Progress data is ephemeral — no persistent audit trail
- Observers must subscribe to a different channel than task state changes
Option E — State machine relaxation
Relax the Task.with_transition() FSM to allow ASSIGNED → COMPLETED or IN_PROGRESS → COMPLETED in one hop.
Pros:
- Simplest implementation — existing
_report_to_task_enginejust works - No additional
transition_task()calls needed inAgentEngine
Cons:
- Loses all real-time observability (same as current)
- Audit trail only shows start/end, not what happened in between
- Does not satisfy the "smart connected system" requirement
Open Research Questions
-
What intermediate states does
AgentEngineactually traverse? Enumerate all lifecycle points where atransition_task()call would be semantically meaningful. Does every agent execution follow the same state path, or does it vary by execution loop (ReactLoop,PlanExecuteLoop, etc.)? -
Version tracking design:
Taskhas noversionfield. Shouldversionbe added toTaskfor transparency, or should it remain internal toTaskEngine._versions? If internal, how doesAgentEnginetrack the last confirmed version across multiple transitions? -
Failure semantics: If an intermediate
transition_task()call fails (e.g., FSM rejects the transition), should the agent abort, retry, or continue and log a warning? What is the correct recovery path? -
Concurrency: If multiple sub-agents are writing back to the same task concurrently (parallel execution), how does optimistic locking interact with incremental sync? Can the version sequence become non-linear?
-
FSM completeness: Does the current
Task.with_transition()state machine support all needed intermediate transitions (IN_PROGRESS → IN_REVIEW → IN_PROGRESS → COMPLETED)? Are any new edges needed? -
Subscriber impact: What downstream subscribers (dashboard, other agents, HR module) currently consume
TaskStateChangedevents? What events do they need to react correctly to partial execution stages? -
Performance: How many
transition_task()calls does a typical agent execution generate under Option A? Is there a risk of queue saturation for long-running agents?
Relevant Files
src/ai_company/engine/agent_engine.py—_report_to_task_engine(), execution lifecyclesrc/ai_company/engine/task_engine.py—transition_task(),_versions, mutation loopsrc/ai_company/engine/task_engine_models.py—TransitionTaskMutation,TaskMutationResult.versionsrc/ai_company/core/task.py—Task.with_transition(), FSM edgessrc/ai_company/core/enums.py—TaskStatusenum valuessrc/ai_company/engine/react_loop.py— execution loop lifecyclesrc/ai_company/engine/plan_execute_loop.py— plan-and-execute lifecycle
Decision Criteria
The chosen approach must satisfy:
- Real-time visibility into in-flight task state (other agents can observe current stage)
- MessageBus events at each meaningful lifecycle transition
- Crash-safe: last known state is recoverable, not forever
IN_PROGRESS - Correctness: no version conflicts under concurrent sub-agent execution
- Auditability: full transition history reconstructible from events
- Simplicity: minimal coupling between
AgentEngineandTaskEngineinternals