-
Notifications
You must be signed in to change notification settings - Fork 0
Implement graceful shutdown with cooperative timeout strategy (DESIGN_SPEC §6.7) #130
Copy link
Copy link
Closed
Labels
prio:highImportant, should be prioritizedImportant, should be prioritizedscope:medium1-3 days of work1-3 days of workspec:agent-systemDESIGN_SPEC Section 3 - Agent SystemDESIGN_SPEC Section 3 - Agent Systemspec:task-workflowDESIGN_SPEC Section 6 - Task & Workflow EngineDESIGN_SPEC Section 6 - Task & Workflow Enginetype:featureNew feature implementationNew feature implementation
Milestone
Description
Context
When the process receives SIGTERM/SIGINT (user Ctrl+C, Docker stop, systemd shutdown), the framework needs to stop cleanly without losing work or leaking costs. The MVP implements the cooperative with timeout strategy behind a ShutdownStrategy protocol.
Acceptance Criteria
ShutdownStrategy Protocol
-
ShutdownStrategyprotocol defined - Protocol is pluggable — new strategies can be registered via config
Cooperative with Timeout Strategy (Default / MVP)
-
shutdown_event(asyncio.Event) set on signal receipt — agents check at turn boundaries - Stop accepting new tasks (drain gate closes)
- Wait up to
grace_seconds(default: 30) for agents to exit cooperatively - Force-cancel remaining agents (
task.cancel()) after grace period - Cleanup phase (
cleanup_seconds, default: 5): persist cost records, close provider connections, flush logs - Configurable via
graceful_shutdown:YAML config block
New TaskStatus: INTERRUPTED
-
INTERRUPTEDadded toTaskStatusenum as a non-terminal state - Valid transitions updated:
IN_PROGRESS → INTERRUPTED,INTERRUPTED → ASSIGNED(reassignment on restart) -
INTERRUPTEDindicates process shutdown, distinct fromFAILED(agent error) andCANCELLED(user action)
Signal Handling
- SIGINT (Ctrl+C) handled cross-platform
- SIGTERM handled on Unix;
signal.signal()fallback on Windows (noloop.add_signal_handler()) - In-flight LLM calls: log request start (input token count) before each provider call for cost audit
Testing
- Unit tests for ShutdownStrategy protocol
- Unit tests for cooperative timeout with simulated agents
- Integration test: send SIGINT → agents stop → tasks marked INTERRUPTED → cleanup runs
Dependencies
- Implement agent engine core with ExecutionLoop protocol integration (DESIGN_SPEC §3.1, §6.1, §6.5) #11 — Agent engine core
- Implement single-task execution lifecycle (assign, execute, complete) #21 — Task execution lifecycle (status transitions)
- Crash recovery issue (FAILED status must exist first — or implement both statuses together)
Design Spec Reference
- §6.7 — Graceful Shutdown Protocol (Strategy 1: Cooperative with Timeout)
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
prio:highImportant, should be prioritizedImportant, should be prioritizedscope:medium1-3 days of work1-3 days of workspec:agent-systemDESIGN_SPEC Section 3 - Agent SystemDESIGN_SPEC Section 3 - Agent Systemspec:task-workflowDESIGN_SPEC Section 6 - Task & Workflow EngineDESIGN_SPEC Section 6 - Task & Workflow Enginetype:featureNew feature implementationNew feature implementation