research: AgentEngine → TaskEngine real-time status sync strategy

## ⚠️ NOT DECIDED — Requires Deep Research

This issue tracks a critical architectural decision that has **not been made yet**. It must be fully researched and reviewed before any implementation begins. The choice here significantly affects observability, correctness, and long-term system reliability.

---

## Problem

`AgentEngine._report_to_task_engine` currently calls `transition_task(COMPLETED | FAILED | CANCELLED)` directly on the `TaskEngine`. This works for the final status, but the task's `TaskEngine` state is frozen at `ASSIGNED` or `IN_PROGRESS` for the entire duration of the agent's execution — meaning:

- Other agents querying `list_tasks()` see stale state (no in-flight visibility)
- The MessageBus receives no intermediate events during execution (no `IN_REVIEW`, no `AWAITING_APPROVAL`)
- If a crash happens mid-execution, the task stays in `IN_PROGRESS` indefinitely with no recovery hint
- There is no version tracking thread across the lifecycle, so a later catch-up write has no safe anchor point

The deeper issue: `Task` itself has **no `version` field** — version lives only in `TaskEngine._versions` (an in-memory dict). So any catch-up loop must use `TaskMutationResult.version` returned by each `submit()` call.

---

## Current Behavior

```
[AgentEngine]                     [TaskEngine]          [MessageBus]
start execution          ────►   IN_PROGRESS            event: IN_PROGRESS
...running for 5 minutes...      IN_PROGRESS            (nothing)
...review gate...                 IN_PROGRESS            (nothing)
finish                   ────►   COMPLETED              event: COMPLETED
```

## Desired Behavior (example — not decided)

```
[AgentEngine]                     [TaskEngine]          [MessageBus]
start execution          ────►   IN_PROGRESS            event: IN_PROGRESS
...working...                     IN_PROGRESS            (internal only)
enter review gate        ────►   IN_REVIEW              event: IN_REVIEW
review complete          ────►   IN_PROGRESS            event: IN_PROGRESS
finish                   ────►   COMPLETED              event: COMPLETED
```

---

## Options Under Consideration

### Option A — Incremental sync (real-time transitions)
`AgentEngine` calls `transition_task()` at each meaningful lifecycle stage:
1. `IN_PROGRESS` at execution start
2. `IN_REVIEW` when entering code review gate
3. `AWAITING_APPROVAL` when blocked on human approval
4. Final status (`COMPLETED` / `FAILED` / `CANCELLED`) at end

**Pros:**
- Real-time visibility into every stage
- MessageBus events fire at each transition — subscribers react immediately
- Crash recovery: last known state is the actual last stage reached
- Aligns with event-sourcing / audit trail patterns

**Cons:**
- `AgentEngine` must know the full task state machine (tight coupling to `TaskStatus` FSM)
- Every lifecycle point needs an explicit `transition_task()` call — easy to forget new states
- If a transition fails (invalid state machine path), execution may be partially logged

---

### Option B — Staged catch-up with optimistic locking
`AgentEngine` tracks transitions locally, then replays them through `TaskEngine` after the fact using the version number from `TaskMutationResult`:

```python
local_journal = [IN_PROGRESS, IN_REVIEW, COMPLETED]
for step in local_journal:
    result = await task_engine.submit(TransitionTaskMutation(..., expected_version=last_version))
    last_version = result.version
```

**Pros:**
- `AgentEngine` stays decoupled from `TaskEngine` during hot execution path
- Version locking prevents replay conflicts if another writer touched the task

**Cons:**
- No real-time visibility — state is only correct after catch-up replay
- If catch-up replay fails mid-way, TaskEngine ends up in a partial state
- `Task` has no `version` field — must track version from `TaskMutationResult` manually
- Adds significant complexity (journal + replay machinery) with no real-time benefit

---

### Option C — Hybrid: incremental sync + correctness safety net
Combine A and B:
- Use **incremental sync** (Option A) for real-time transitions
- On final completion, use **version-anchored verification** to confirm TaskEngine state matches expectation
- If verification fails (e.g., stale version, missed intermediate transition), emit a compensating mutation

**Pros:**
- Real-time observability AND correctness guarantee
- Crash recovery works at the last confirmed transition
- Best of both worlds

**Cons:**
- More implementation surface area
- Need to design the compensation logic carefully to avoid infinite correction loops

---

### Option D — Heartbeat / polling model
`AgentEngine` publishes progress to a lightweight side-channel (e.g., an in-memory progress dict or MessageBus topic), while `TaskEngine` state is only updated at start and end.

**Pros:**
- Minimal coupling between execution and task state machine
- Fast — no queue round-trips during execution

**Cons:**
- TaskEngine state is still stale (no intermediate transitions)
- Progress data is ephemeral — no persistent audit trail
- Observers must subscribe to a different channel than task state changes

---

### Option E — State machine relaxation
Relax the `Task.with_transition()` FSM to allow `ASSIGNED → COMPLETED` or `IN_PROGRESS → COMPLETED` in one hop.

**Pros:**
- Simplest implementation — existing `_report_to_task_engine` just works
- No additional `transition_task()` calls needed in `AgentEngine`

**Cons:**
- Loses all real-time observability (same as current)
- Audit trail only shows start/end, not what happened in between
- Does not satisfy the "smart connected system" requirement

---

## Open Research Questions

1. **What intermediate states does `AgentEngine` actually traverse?** Enumerate all lifecycle points where a `transition_task()` call would be semantically meaningful. Does every agent execution follow the same state path, or does it vary by execution loop (`ReactLoop`, `PlanExecuteLoop`, etc.)?

2. **Version tracking design**: `Task` has no `version` field. Should `version` be added to `Task` for transparency, or should it remain internal to `TaskEngine._versions`? If internal, how does `AgentEngine` track the last confirmed version across multiple transitions?

3. **Failure semantics**: If an intermediate `transition_task()` call fails (e.g., FSM rejects the transition), should the agent abort, retry, or continue and log a warning? What is the correct recovery path?

4. **Concurrency**: If multiple sub-agents are writing back to the same task concurrently (parallel execution), how does optimistic locking interact with incremental sync? Can the version sequence become non-linear?

5. **FSM completeness**: Does the current `Task.with_transition()` state machine support all needed intermediate transitions (`IN_PROGRESS → IN_REVIEW → IN_PROGRESS → COMPLETED`)? Are any new edges needed?

6. **Subscriber impact**: What downstream subscribers (dashboard, other agents, HR module) currently consume `TaskStateChanged` events? What events do they need to react correctly to partial execution stages?

7. **Performance**: How many `transition_task()` calls does a typical agent execution generate under Option A? Is there a risk of queue saturation for long-running agents?

---

## Relevant Files

- `src/ai_company/engine/agent_engine.py` — `_report_to_task_engine()`, execution lifecycle
- `src/ai_company/engine/task_engine.py` — `transition_task()`, `_versions`, mutation loop
- `src/ai_company/engine/task_engine_models.py` — `TransitionTaskMutation`, `TaskMutationResult.version`
- `src/ai_company/core/task.py` — `Task.with_transition()`, FSM edges
- `src/ai_company/core/enums.py` — `TaskStatus` enum values
- `src/ai_company/engine/react_loop.py` — execution loop lifecycle
- `src/ai_company/engine/plan_execute_loop.py` — plan-and-execute lifecycle

---

## Decision Criteria

The chosen approach must satisfy:
- [ ] Real-time visibility into in-flight task state (other agents can observe current stage)
- [ ] MessageBus events at each meaningful lifecycle transition
- [ ] Crash-safe: last known state is recoverable, not forever `IN_PROGRESS`
- [ ] Correctness: no version conflicts under concurrent sub-agent execution
- [ ] Auditability: full transition history reconstructible from events
- [ ] Simplicity: minimal coupling between `AgentEngine` and `TaskEngine` internals


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

research: AgentEngine → TaskEngine real-time status sync strategy #323

⚠️ NOT DECIDED — Requires Deep Research

Problem

Current Behavior

Desired Behavior (example — not decided)

Options Under Consideration

Option A — Incremental sync (real-time transitions)

Option B — Staged catch-up with optimistic locking

Option C — Hybrid: incremental sync + correctness safety net

Option D — Heartbeat / polling model

Option E — State machine relaxation

Open Research Questions

Relevant Files

Decision Criteria

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

research: AgentEngine → TaskEngine real-time status sync strategy #323

Description

⚠️ NOT DECIDED — Requires Deep Research

Problem

Current Behavior

Desired Behavior (example — not decided)

Options Under Consideration

Option A — Incremental sync (real-time transitions)

Option B — Staged catch-up with optimistic locking

Option C — Hybrid: incremental sync + correctness safety net

Option D — Heartbeat / polling model

Option E — State machine relaxation

Open Research Questions

Relevant Files

Decision Criteria

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions